

We are releasing jina-embeddings-v5-text, the fifth generation of our embedding model family, pushing the quality-efficiency frontier for sub-1B multilingual embeddings:
- jina-embeddings-v5-text-small (677M parameters): 67.0 on MMTEB, 71.7 on MTEB English
- jina-embeddings-v5-text-nano (239M parameters): 65.5 on MMTEB, 71.0 on MTEB English
Both models support 32K token context, 4 task-specific LoRA adapters (retrieval, text-matching, classification, clustering), and Matryoshka dimension truncation from 1024 down to 32. At 239M parameters, the nano model matches the retrieval quality of models with twice as many parameters.
Compared to our previous generations: v5-text-small matches jina-embeddings-v4 (3.8B) on retrieval while being 5.6x smaller, and outperforms jina-embeddings-v3 (572M) across all tasks with a similar parameter count.
| Feature | v5-text-small | v5-text-nano |
|---|---|---|
| Base Model | Qwen3-0.6B-Base | EuroBERT-210m |
| Parameters | 677M | 239M |
| Embedding Dimensions | 1024 | 768 |
| Context Length | 32,768 | 32,768 |
| Languages | 119 (Qwen3 tokenizer) | 15+ (EuroBERT tokenizer) |
| Pooling | Last-token | Last-token |
| LoRA Adapters | 4 (retrieval, text-matching, classification, clustering) | |
| Matryoshka Dims | 32-1024 | 32-768 |
| MMTEB Score | 67.0 | 65.5 |
| MTEB English | 71.7 | 71.0 |
| License | CC BY-NC 4.0 |

v5-text-small achieves 67.0 on MMTEB (averaged across 131 tasks in 9 task types), outperforming the next best sub-1B model (Qwen3-0.6B with instructions at 64.3) by +2.7 points. The nano model at 239M parameters scores 65.5, beating models with twice as many parameters.
v5-text-small leads all sub-1B multilingual models with 71.7 (averaged across 41 tasks in 7 task types), followed closely by KaLM-mini-v2.5 (71.3) and v5-text-nano (71.0). The 239M nano model achieves parity with the 494M KaLM at less than half its size.
v5-text-small achieves the highest task-level average (63.28) across five retrieval benchmarks (MTEB Multilingual, MTEB English, RTEB, BEIR, and LongEmbed) among sub-4B models, matching jina-embeddings-v4 (3.8B, 63.62) while being 5.6x smaller.tagArchitecture

v5-text uses decoder-only backbones with last-token pooling instead of mean pooling. Four lightweight LoRA adapters are injected at each transformer layer, handling retrieval, text-matching, classification, and clustering independently. Users select the appropriate adapter at inference time. For retrieval, queries get a "Query:" prefix and documents get "Document:". Context length is 32K tokens, a 4x increase over v3.
tagGetting Started
tagElastic Inference Service
The fastest way to use v5-text in production. Elastic Inference Service (EIS) provides managed embedding inference with built-in scaling, so you can generate embeddings directly within your Elastic deployment without managing infrastructure.
PUT _inference/text_embedding/jina-v5
{
"service": "elastic",
"service_settings": {
"model_id": "jina-embeddings-v5-text-small"
}
}
See the EIS documentation for setup details.
tagJina Embedding API
Our hosted API with pay-per-token pricing. Supports task selection, dimension truncation, and batch processing out of the box. No GPU required.
curl https://api.jina.ai/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "jina-embeddings-v5-text-small",
"task": "retrieval.query",
"dimensions": 1024,
"input": ["What is knowledge distillation?"]
}'
Get an API key at jina.ai/embeddings.
tagHugging Face + sentence-transformers
Run locally with full control over inference. Weights are available on Hugging Face with out-of-the-box sentence-transformers integration.
from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer(
"jinaai/jina-embeddings-v5-text-small-retrieval",
model_kwargs={"dtype": torch.bfloat16},
)
query_emb = model.encode("What is knowledge distillation?", prompt_name="query")
doc_embs = model.encode(["Knowledge distillation transfers...", "Venus is..."], prompt_name="document")
similarity = model.similarity(query_emb, doc_embs)
tagvLLM
High-throughput serving for production workloads. vLLM supports v5-text natively with last-token pooling.
from vllm import LLM
from vllm.config.pooler import PoolerConfig
model = LLM(
model="jinaai/jina-embeddings-v5-text-small-retrieval",
dtype="float16",
runner="pooling",
pooler_config=PoolerConfig(seq_pooling_type="LAST", normalize=True),
)
outputs = model.encode(["Query: climate change impacts"], pooling_task="embed")
For optimized local inference via llama.cpp and MLX, each task adapter's LoRA weights are merged into the base model to produce standalone weight files. This is why you see separate repositories per task (retrieval, text-matching, classification, clustering) -- each contains the full merged weights ready for direct loading, with no LoRA overhead at inference time.
tagllama.cpp (GGUF)
Run quantized models on CPU or edge devices. We provide 14 GGUF quantization variants for each model, from F16 down to IQ1_S.
llama-server -hf jinaai/jina-embeddings-v5-text-small-retrieval-GGUF:Q4_K_M \
--embedding --pooling last -ub 32768
tagMLX
Native Apple Silicon inference via MLX. Available in full precision, 4-bit, and 8-bit quantization for all task adapters.
import mlx.core as mx
from tokenizers import Tokenizer
from model import JinaEmbeddingModel
import json
with open("config.json") as f:
config = json.load(f)
model = JinaEmbeddingModel(config)
weights = mx.load("model-4bit.safetensors") # or model.safetensors, model-8bit.safetensors
model.load_weights(list(weights.items()))
tokenizer = Tokenizer.from_file("tokenizer.json")
texts = ["Query: What is machine learning?"]
embeddings = model.encode(texts, tokenizer)
Download from Hugging Face: jinaai/jina-embeddings-v5-text-small-retrieval-mlx (also available for text-matching, classification, and clustering adapters).
tagTraining
Both models are distilled from Qwen3-Embedding-4B, a much larger trained embedding model. The small variant uses Qwen3-0.6B-Base as its backbone, while nano uses EuroBERT-210m. Our training combines two complementary signals:
- Embedding distillation from the 4B teacher via cosine similarity loss. The student learns to approximate the teacher's embedding space without requiring instruction-style prompts. This is particularly effective for languages and tasks where labeled data is scarce.
- Task-specific contrastive loss (
InfoNCE) on labeled query-document pairs with hard negative mining and in-batch negatives. After freezing the distilled backbone, we train separate LoRA adapters for each task category.
Our ablation studies show this combined approach consistently outperforms either method alone. On MTEB English retrieval, the combined method achieves 60.1 nDCG@10 vs 58.6 for distillation-only and 54.3 for contrastive-only on the same backbone.
We also apply GOR (Generalized Orthogonal Regularization) during training, which encourages embedding components to be more uniformly distributed. This does not dramatically improve standard benchmark scores, but it makes binary quantization nearly lossless, a critical property for memory-constrained deployment.
A few observations from training worth noting:
- Distillation and contrastive learning are complementary in ways we did not initially expect.
- Removing any single component from our loss mixture degrades performance across the board.
- Task-specific LoRA adapters outperform multi-task training at negligible parameter overhead.
- GOR regularization makes binary quantization nearly lossless, which matters more for deployment than marginal full-precision gains.
tagConclusion
Embedding models are increasingly used as tool-chain components inside larger systems. LLM agents call embedding APIs for retrieval, memory, and classification as part of agentic workflows. Projects like OpenClaw and OpenViking treat embeddings as a core infrastructure layer for agent context management, not as standalone search endpoints. In this regime, inference cost and latency per call matter as much as benchmark scores, and compact models become the natural choice.
The trend toward smaller embedding models reflects a broader shift. On-device retrieval, browser-based search, and edge deployment all demand models that fit in constrained memory budgets. Matryoshka dimension support lets a single model serve both high-precision and ultra-fast approximate search without retraining. Combined with GGUF quantization down to 1-2 bits, the effective memory footprint for a production embedding service drops by an order of magnitude.
We are working on jina-embeddings-v5-multimodal, extending the same architecture to vision and cross-modal retrieval. Early results suggest that aligning a vision encoder with a fine-tuned text embedding model is possible without degrading text performance. Stay tuned.






