

We are releasing jina-embeddings-v5-omni, extending our v5-text embedding models to images, audio, and video. Both models share the same frozen text backbone as v5-text, meaning text embeddings are identical - no index rebuild needed. jina-embeddings-v5-omni-small scores 53.93 on average across four modalities, matching LCO-7B (54.43) at 5.7x fewer parameters, while jina-embeddings-v5-omni-nano delivers competitive document retrieval at just 0.95B parameters.




tagArchitecture
v5-omni keeps the v5-text backbone completely frozen and adds pretrained vision and audio encoders connected through small trainable projectors:
- Vision: Qwen3.5 vision encoders (adapted from SigLIP2) with 2x2 spatial merge (4x token reduction). We freeze everything except the final projection layer (
fc_vision_2), which we replace with a randomly initialized layer mapping into the text backbone's hidden dimension. - Audio: Qwen2.5-Omni encoder (adapted from Whisper-large-v3). A single randomly initialized
fc_audiolayer projects the 1280-dimensional output into the text backbone. - Video: Handled as a sequence of visual frames, optionally preceded by an extracted audio segment.
The model inherits v5-text's four task-specific LoRA adapters (retrieval, text-matching, classification, clustering) and trains separate projector weights for each task variant. The architecture is fully modular: text-only deployment loads no vision or audio weights (identical to v5-text footprint), image-only skips audio, full omni loads everything.

| Feature | jina-embeddings-v5-omni-small | jina-embeddings-v5-omni-nano |
|---|---|---|
| Base Text Model | jina-embeddings-v5-text-small (Qwen3-0.6B) | jina-embeddings-v5-text-nano (EuroBERT-210m) |
| Total Parameters | ~1.56B | ~1.04B |
| Modalities | Text, Image, Audio, Video, PDF | Text, Image, Audio, Video, PDF |
| Embedding Dimensions | 1024 | 768 |
| Matryoshka Dimensions | 32, 64, 128, 256, 512, 768, 1024 | 32, 64, 128, 256, 512, 768 |
| Max Sequence Length | 32768 tokens | 8192 tokens |
| Vision Encoder | Qwen3.5-2B ViT (SigLIP2) | SigLIP2 Base |
| Audio Encoder | Whisper-large-v3 | Whisper-large-v3 |
| Tasks | retrieval, text-matching, classification, clustering | retrieval, text-matching, classification, clustering |
| Text Compatibility | Identical to jina-embeddings-v5-text-small | Identical to jina-embeddings-v5-text-nano |
| Trainable Parameters | ~18M projectors (0.35%) | ~7M projectors (0.35%) |
| Pooling | Last-token | Last-token |
| License | CC BY-NC 4.0 | CC BY-NC 4.0 |
tagGetting Started
tagElasticsearch (Elastic Inference Service)
If you are already using jina-embeddings-v5-text in Elasticsearch, your existing text indexes work with v5-omni out of the box. The omni models produce identical embeddings for text inputs as v5-text - same input, same vector, byte-for-byte. You do not need to re-embed or rebuild any text index. To start searching images, audio, and video alongside your existing text data, simply create a new index with v5-omni and ingest your multimodal content into it.
Create a semantic_text index with v5-omni as the inference endpoint. EIS automatically selects the correct LoRA adapter for indexing and retrieval:
PUT multimodal-semantic-index
{
"mappings": {
"properties": {
"content": {
"type": "semantic_text",
"inference_id": ".jina-embeddings-v5-omni-small"
}
}
}
}Ingest text, images (as base64 data URIs), audio, and video into the same field, the same index:
// Ingest text
POST multimodal-semantic-index/_doc
{
"content": "'Kraft Dinner' is what Canadians call macaroni and cheese when prepared from a kit."
}
// Ingest an image (base64)
POST multimodal-semantic-index/_doc
{
"content": "data:image/png;base64,iVBORw0KGgoAAAAN..."
}Search across all modalities with a single text query:
GET multimodal-semantic-index/_search
{
"query": {
"semantic": {
"field": "content",
"query": "Was bedeutet 'Kraft Dinner' für Kanadier?"
}
}
}tagJina Embedding API
curl https://api.jina.ai/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "jina-embeddings-v5-omni-small",
"task": "retrieval.query",
"dimensions": 1024,
"input": ["What does this image show?"],
"images": ["data:image/png;base64,..."]
}'tagHugging Face
from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer(
"jinaai/jina-embeddings-v5-omni-small-retrieval",
model_kwargs={"dtype": torch.bfloat16},
)
# Text embedding (identical to v5-text)
text_emb = model.encode("What is knowledge distillation?", prompt_name="query")
# Image embedding
from PIL import Image
img = Image.open("photo.jpg")
img_emb = model.encode(img)
# Cross-modal similarity
similarity = model.similarity(text_emb, img_emb)tagTraining
The core idea is frozen-encoder model composition: take a strong text embedding model, add pretrained vision and audio encoders, connect them with small trainable projectors, and freeze everything except those projectors. Only 0.35% of total weights are trained, which gives us three properties: (1) text identity preservation - the backbone is unmodified, same input produces identical output; (2) training efficiency - projector-only training is 1.8-3.9x faster with 42-64% less GPU memory; (3) modularity - towers can be loaded independently.

v5-omni inherits Matryoshka dimension support from v5-text. Image and audio embeddings preserve most quality under truncation, while video degrades more at small dimensions.

tagConclusion
The conventional wisdom says multimodal embeddings require training the entire model end-to-end. We disagree. v5-omni freezes the text backbone, trains 0.35% of weights, and matches models 5-7x its size. The lesson: composition beats retraining. A strong text encoder is the hardest part – once you have it, bolting on vision and audio via lightweight projectors is almost free.
This matters for production. Your existing v5-text indexes is untouched. Same query, same vector, byte-for-byte. You just gained image, audio, and video search without re-embedding a single document. That is the real unlock multimodal retrieval as a drop-in upgrade, not a migration project.
jina-embeddings-v5-omni-small is the best-performing open-weight omni embedding model under 2B parameters. jina-embeddings-v5-omni-nano does it at 0.9B. Both available now on Hugging Face, Jina Search Foundation API, and as a native inference endpoint in Elasticsearch.






