jina-embeddings-v5-text: New SOTA Small Multilingual Embeddings

jina-embeddings-v5-text: Task-Targeted Embedding Distillation

Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32K tokens for small, 8K for nano) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.

arXiv.orgMohammad Kalim Akram

We are releasing jina-embeddings-v5-text, the fifth generation of our embedding model family, pushing the quality-efficiency frontier for sub-1B multilingual embeddings:

jina-embeddings-v5-text-small (677M parameters): 67.0 on MMTEB, 71.7 on MTEB English
jina-embeddings-v5-text-nano (239M parameters): 65.5 on MMTEB, 71.0 on MTEB English

The small model supports 32K token context (8K for nano), 4 task-specific LoRA adapters (retrieval, text-matching, classification, clustering), and Matryoshka dimension truncation from 1024 down to 32. At 239M parameters, the nano model matches the retrieval quality of models with twice as many parameters.

Compared to our previous generations: v5-text-small matches jina-embeddings-v4 (3.8B) on retrieval while being 5.6x smaller, and outperforms jina-embeddings-v3 (572M) across all tasks with a similar parameter count.

Feature	v5-text-small	v5-text-nano
Base Model	`Qwen3-0.6B-Base`	`EuroBERT-210m`
Parameters	677M	239M
Embedding Dimensions	1024	768
Context Length	32,768	8,192
Languages	119 (Qwen3 tokenizer)	15+ (EuroBERT tokenizer)
Pooling	Last-token	Last-token
LoRA Adapters	4 (retrieval, text-matching, classification, clustering)
Matryoshka Dims	32-1024	32-768
MMTEB Score	67.0	65.5
MTEB English	71.7	71.0
License	CC BY-NC 4.0

MMTEB Multilingual Benchmark — `v5-text-small` achieves **67.0** on MMTEB (averaged across 131 tasks in 9 task types), outperforming the next best sub-1B model (Qwen3-0.6B with instructions at 64.3) by **+2.7 points**. The nano model at 239M parameters scores 65.5, beating models with twice as many parameters.

MTEB English Benchmark — On English-only evaluation, `v5-text-small` leads all sub-1B multilingual models with **71.7** (averaged across 41 tasks in 7 task types), followed closely by `KaLM-mini-v2.5` (71.3) and `v5-text-nano` (71.0). The 239M nano model achieves parity with the 494M KaLM at less than half its size.

Retrieval Benchmark Results — `v5-text-small` achieves the highest task-level average (**63.28**) across five retrieval benchmarks (MTEB Multilingual, MTEB English, RTEB, BEIR, and LongEmbed) among sub-4B models, matching jina-embeddings-v4 (3.8B, 63.62) while being **5.6x smaller**.

Table showing multilingual MTEB model performance metrics, rankings, and evaluations across various tasks and language benchm — According to the MTEB Leaderboard on 2026/02/21, `jina-embeddings-v5-small` (0.6B params, rank #8) is the strongest embedding model under 1B parameters on MTEB Multilingual v2, outperforming Qwen3-Embedding-0.6b across every metric. `jina-embeddings-v5-nano` (0.2B params, rank #11) delivers top-11 performance at a fraction of the size, no other model in this parameter class comes close.

tagArchitecture

<code>jina-embeddings-v5-text</code> Architecture

v5-text uses decoder-only backbones with last-token pooling instead of mean pooling. Four lightweight LoRA adapters are injected at each transformer layer, handling retrieval, text-matching, classification, and clustering independently. Users select the appropriate adapter at inference time. For retrieval, queries get a "Query:" prefix and documents get "Document:". Context length is 32K tokens for small (8K for nano), a 4x increase over v3.

tagGetting Started

tagElastic Inference Service

The fastest way to use v5-text in production. Elastic Inference Service (EIS) provides managed embedding inference with built-in scaling, so you can generate embeddings directly within your Elastic deployment without managing infrastructure.

PUT _inference/text_embedding/jina-v5
{
  "service": "elastic",
  "service_settings": {
    "model_id": "jina-embeddings-v5-text-small"
  }
}

See the EIS documentation for setup details.

tagJina Embedding API

Our hosted API with pay-per-token pricing. Supports task selection, dimension truncation, and batch processing out of the box. No GPU required.

curl https://api.jina.ai/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "jina-embeddings-v5-text-small",
    "task": "retrieval.query",
    "dimensions": 1024,
    "input": ["What is knowledge distillation?"]
  }'

Get an API key at jina.ai/embeddings.

tagHugging Face + sentence-transformers

Run locally with full control over inference. Weights are available on Hugging Face with out-of-the-box sentence-transformers integration.

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "jinaai/jina-embeddings-v5-text-small-retrieval",
    model_kwargs={"dtype": torch.bfloat16},
)

query_emb = model.encode("What is knowledge distillation?", prompt_name="query")
doc_embs = model.encode(["Knowledge distillation transfers...", "Venus is..."], prompt_name="document")
similarity = model.similarity(query_emb, doc_embs)

tagvLLM

High-throughput serving for production workloads. vLLM supports v5-text natively with last-token pooling.

from vllm import LLM
from vllm.config.pooler import PoolerConfig

model = LLM(
    model="jinaai/jina-embeddings-v5-text-small-retrieval",
    dtype="float16",
    runner="pooling",
    pooler_config=PoolerConfig(seq_pooling_type="LAST", normalize=True),
)
outputs = model.encode(["Query: climate change impacts"], pooling_task="embed")

For optimized local inference via llama.cpp and MLX, each task adapter's LoRA weights are merged into the base model to produce standalone weight files. This is why you see separate repositories per task (retrieval, text-matching, classification, clustering) -- each contains the full merged weights ready for direct loading, with no LoRA overhead at inference time.

tagllama.cpp (GGUF)

Run quantized models on CPU or edge devices. We provide 14 GGUF quantization variants for each model, from F16 down to IQ1_S.

llama-server -hf jinaai/jina-embeddings-v5-text-small-retrieval-GGUF:Q4_K_M \
  --embedding --pooling last -ub 32768

tagMLX

Native Apple Silicon inference via MLX. Available in full precision, 4-bit, and 8-bit quantization for all task adapters.

import mlx.core as mx
from tokenizers import Tokenizer
from model import JinaEmbeddingModel
import json

with open("config.json") as f:
    config = json.load(f)

model = JinaEmbeddingModel(config)
weights = mx.load("model-4bit.safetensors")  # or model.safetensors, model-8bit.safetensors
model.load_weights(list(weights.items()))

tokenizer = Tokenizer.from_file("tokenizer.json")
texts = ["Query: What is machine learning?"]
embeddings = model.encode(texts, tokenizer)

Download from Hugging Face: jinaai/jina-embeddings-v5-text-small-retrieval-mlx (also available for text-matching, classification, and clustering adapters).

tagTraining

Both models are distilled from Qwen3-Embedding-4B, a much larger trained embedding model. The small variant uses Qwen3-0.6B-Base as its backbone, while nano uses EuroBERT-210m. Our training combines two complementary signals:

Embedding distillation from the 4B teacher via cosine similarity loss. The student learns to approximate the teacher's embedding space without requiring instruction-style prompts. This is particularly effective for languages and tasks where labeled data is scarce.
Task-specific contrastive loss (InfoNCE) on labeled query-document pairs with hard negative mining and in-batch negatives. After freezing the distilled backbone, we train separate LoRA adapters for each task category.

Our ablation studies show this combined approach consistently outperforms either method alone. On MTEB English retrieval, the combined method achieves 60.1 nDCG@10 vs 58.6 for distillation-only and 54.3 for contrastive-only on the same backbone.

We also apply GOR (Generalized Orthogonal Regularization) during training, which encourages embedding components to be more uniformly distributed. This does not dramatically improve standard benchmark scores, but it makes binary quantization nearly lossless, a critical property for memory-constrained deployment.

A few observations from training worth noting:

Distillation and contrastive learning are complementary in ways we did not initially expect.
Removing any single component from our loss mixture degrades performance across the board.
Task-specific LoRA adapters outperform multi-task training at negligible parameter overhead.
GOR regularization makes binary quantization nearly lossless, which matters more for deployment than marginal full-precision gains.

tagConclusion

Embedding models are increasingly used as tool-chain components inside larger systems. LLM agents call embedding APIs for retrieval, memory, and classification as part of agentic workflows. Projects like OpenClaw and OpenViking treat embeddings as a core infrastructure layer for agent context management, not as standalone search endpoints. In this regime, inference cost and latency per call matter as much as benchmark scores, and compact models become the natural choice.

The trend toward smaller embedding models reflects a broader shift. On-device retrieval, browser-based search, and edge deployment all demand models that fit in constrained memory budgets. Matryoshka dimension support lets a single model serve both high-precision and ultra-fast approximate search without retraining. Combined with GGUF quantization down to 1-2 bits, the effective memory footprint for a production embedding service drops by an order of magnitude.

We are working on jina-embeddings-v5-multimodal, extending the same architecture to vision and cross-modal retrieval. Early results suggest that aligning a vision encoder with a fine-tuned text embedding model is possible without degrading text performance. Stay tuned.