News
Models
API
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
Elastic Inference Service
Run Jina models natively inside Elasticsearch.
MCP terminalCLIarticlellms.txtsmart_toyAgentsdata_objectSchemamenu_bookDocs



Log in
login
Architecture
Getting Started
Training
Conclusion
star
Featured
Press release
May 12, 2026

jina-embeddings-v5-omni: Embeddings for Text, Image, Audio and Video

One model, four modalities: text, image, audio, video. Best-in-class omni embeddings in 1.6B and 0.9B.
Han Xiao
Han Xiao • 7 minutes read
jina-embeddings-v5-omni - a jinaai Collection
Multimodal (text + image + video + audio) embedding models aligned with jina-embeddings-v5-text-*. Two sizes, four task variants each.
a jinaai Collection
jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition
In this work, we introduce frozen-encoder model composition, a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. Our method is to extend the two Jina Embeddings v5 Text models to support additional media by adding encoders for images and audio. The backbone text embedding models and the added non-text media encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that this approach produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.
arXiv.orgFlorian Hönicke

We are releasing jina-embeddings-v5-omni, extending our v5-text embedding models to images, audio, and video. Both models share the same frozen text backbone as v5-text, meaning text embeddings are identical - no index rebuild needed. jina-embeddings-v5-omni-small scores 53.93 on average across four modalities, matching LCO-7B (54.43) at 5.7x fewer parameters, while jina-embeddings-v5-omni-nano delivers competitive document retrieval at just 0.95B parameters.

Pareto frontier
Pareto frontier of all open-weight omni embedding models (supporting text, image, audio, and video). jina-embeddings-v5-omni-small (1.57B) matches the average score of LCO-7B (8.93B) while using 5.7x fewer parameters. jina-embeddings-v5-omni-nano (0.95B) outperforms LanguageBind (1.14B) by +8.9 points. Baselines: LanguageBind, Omni-Embed-Nemotron-3B, LCO-Embedding-Omni-3B, LCO-Embedding-Omni-7B.
Per-modality scores
Per-modality breakdown across Text (MMTEB), Image (MIEB), Video (MMEB-Video), and Audio (MAEB). jina-embeddings-v5-omni-small leads all omni models on text with 67.0, inheriting jina-embeddings-v5-text-small's full quality. On image (56.05), it excels at classification (68.55) and clustering (84.57, best among all models). Audio (51.46) is close to LCO-7B (52.37), with the best audio classification score (55.89). Video (41.20) is the current gap vs LCO-7B (47.41), as temporal reasoning benefits more from end-to-end training.
Task breakdown
Per-task performance across 13 task types. Gold stars mark tasks where jina-embeddings-v5-omni-small beats the best open-weight baseline (3-9x larger). Wins: image classification (68.55 vs 64.30), image clustering (84.57 vs 83.24), audio classification (55.89 vs 53.39). Main gaps: video retrieval (27.82 vs 58.73) and compositional/VQA (44.23 vs 53.40).
Document retrieval
Document retrieval (ViDoRe-in-MIEB). jina-embeddings-v5-omni-small at 0.92B active text+image parameters scores 79.08, outperforming LCO-3B (78.24 at 4.07B). jina-embeddings-v5-omni-nano scores 70.05 with just 0.31B active parameters, far above LanguageBind (37.33). Nemotron-3B leads at 85.64 but uses 5.1x more parameters.

tagArchitecture

v5-omni keeps the v5-text backbone completely frozen and adds pretrained vision and audio encoders connected through small trainable projectors:

  • Vision: Qwen3.5 vision encoders (adapted from SigLIP2) with 2x2 spatial merge (4x token reduction). We freeze everything except the final projection layer (fc_vision_2), which we replace with a randomly initialized layer mapping into the text backbone's hidden dimension.
  • Audio: Qwen2.5-Omni encoder (adapted from Whisper-large-v3). A single randomly initialized fc_audio layer projects the 1280-dimensional output into the text backbone.
  • Video: Handled as a sequence of visual frames, optionally preceded by an extracted audio segment.

The model inherits v5-text's four task-specific LoRA adapters (retrieval, text-matching, classification, clustering) and trains separate projector weights for each task variant. The architecture is fully modular: text-only deployment loads no vision or audio weights (identical to v5-text footprint), image-only skips audio, full omni loads everything.

Architecture
v5-omni architecture. Frozen vision and audio encoders feed trainable projectors into the frozen text backbone. Only the projectors (0.35% of total weights) are trained. Task-specific LoRA adapters handle retrieval, classification, clustering, and text-matching.
Featurejina-embeddings-v5-omni-smalljina-embeddings-v5-omni-nano
Base Text Modeljina-embeddings-v5-text-small (Qwen3-0.6B)jina-embeddings-v5-text-nano (EuroBERT-210m)
Total Parameters~1.56B~1.04B
ModalitiesText, Image, Audio, Video, PDFText, Image, Audio, Video, PDF
Embedding Dimensions1024768
Matryoshka Dimensions32, 64, 128, 256, 512, 768, 102432, 64, 128, 256, 512, 768
Max Sequence Length32768 tokens8192 tokens
Vision EncoderQwen3.5-2B ViT (SigLIP2)SigLIP2 Base
Audio EncoderWhisper-large-v3Whisper-large-v3
Tasksretrieval, text-matching, classification, clusteringretrieval, text-matching, classification, clustering
Text CompatibilityIdentical to jina-embeddings-v5-text-smallIdentical to jina-embeddings-v5-text-nano
Trainable Parameters~18M projectors (0.35%)~7M projectors (0.35%)
PoolingLast-tokenLast-token
LicenseCC BY-NC 4.0CC BY-NC 4.0

tagGetting Started

tagElasticsearch (Elastic Inference Service)

If you are already using jina-embeddings-v5-text in Elasticsearch, your existing text indexes work with v5-omni out of the box. The omni models produce identical embeddings for text inputs as v5-text - same input, same vector, byte-for-byte. You do not need to re-embed or rebuild any text index. To start searching images, audio, and video alongside your existing text data, simply create a new index with v5-omni and ingest your multimodal content into it.

Create a semantic_text index with v5-omni as the inference endpoint. EIS automatically selects the correct LoRA adapter for indexing and retrieval:

PUT multimodal-semantic-index
{
  "mappings": {
    "properties": {
      "content": {
        "type": "semantic_text",
        "inference_id": ".jina-embeddings-v5-omni-small"
      }
    }
  }
}

Ingest text, images (as base64 data URIs), audio, and video into the same field, the same index:

// Ingest text
POST multimodal-semantic-index/_doc
{
  "content": "'Kraft Dinner' is what Canadians call macaroni and cheese when prepared from a kit."
}

// Ingest an image (base64)
POST multimodal-semantic-index/_doc
{
  "content": "data:image/png;base64,iVBORw0KGgoAAAAN..."
}

Search across all modalities with a single text query:

GET multimodal-semantic-index/_search
{
  "query": {
    "semantic": {
      "field": "content",
      "query": "Was bedeutet 'Kraft Dinner' für Kanadier?"
    }
  }
}

tagJina Embedding API

curl https://api.jina.ai/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "jina-embeddings-v5-omni-small",
    "task": "retrieval.query",
    "dimensions": 1024,
    "input": ["What does this image show?"],
    "images": ["data:image/png;base64,..."]
  }'

tagHugging Face

from sentence_transformers import SentenceTransformer
import torch

model = SentenceTransformer(
    "jinaai/jina-embeddings-v5-omni-small-retrieval",
    model_kwargs={"dtype": torch.bfloat16},
)

# Text embedding (identical to v5-text)
text_emb = model.encode("What is knowledge distillation?", prompt_name="query")

# Image embedding
from PIL import Image
img = Image.open("photo.jpg")
img_emb = model.encode(img)

# Cross-modal similarity
similarity = model.similarity(text_emb, img_emb)

tagTraining

The core idea is frozen-encoder model composition: take a strong text embedding model, add pretrained vision and audio encoders, connect them with small trainable projectors, and freeze everything except those projectors. Only 0.35% of total weights are trained, which gives us three properties: (1) text identity preservation - the backbone is unmodified, same input produces identical output; (2) training efficiency - projector-only training is 1.8-3.9x faster with 42-64% less GPU memory; (3) modularity - towers can be loaded independently.

Training efficiency
Projector-only training vs full training on 4x H100 GPUs (batch size 256, 15K steps). Audio projector training is particularly efficient: 3.2x faster for small (154 min vs 497 min) and 3.9x faster for nano (112 min vs 441 min). Memory savings of 42-64% come from not storing gradients and optimizer states for frozen encoders.

v5-omni inherits Matryoshka dimension support from v5-text. Image and audio embeddings preserve most quality under truncation, while video degrades more at small dimensions.

Radar summary
Summary: per-modality profile of v5-omni vs the strongest baselines. jina-embeddings-v5-omni-small at 1.57B covers text, image, and audio competitively, with video as the remaining gap to close.

tagConclusion

The conventional wisdom says multimodal embeddings require training the entire model end-to-end. We disagree. v5-omni freezes the text backbone, trains 0.35% of weights, and matches models 5-7x its size. The lesson: composition beats retraining. A strong text encoder is the hardest part – once you have it, bolting on vision and audio via lightweight projectors is almost free.

This matters for production. Your existing v5-text indexes is untouched. Same query, same vector, byte-for-byte. You just gained image, audio, and video search without re-embedding a single document. That is the real unlock multimodal retrieval as a drop-in upgrade, not a migration project.

jina-embeddings-v5-omni-small is the best-performing open-weight omni embedding model under 2B parameters. jina-embeddings-v5-omni-nano does it at 0.9B. Both available now on Hugging Face, Jina Search Foundation API, and as a native inference endpoint in Elasticsearch.

Categories:
star
Featured
Press release
rss_feed

Read more
February 19, 2026 • 7 minutes read
jina-embeddings-v5-text: New SOTA Small Multilingual Embeddings
Han Xiao
Abstract digital artwork in black and white, featuring scattered dots forming letters in a halftone effect. The central lette
December 04, 2025 • 7 minutes read
Jina-VLM: Small Multilingual Vision Language Model
Jina AI
Artistic representation of "Vln" in vibrant, rainbow-like colors on a minimalistic white background, with a focus on color di
October 03, 2025 • 7 minutes read
Jina Reranker v3: 0.6B Listwise Reranker for SOTA Multilingual Retrieval
Jina AI
Light blue background with stylized text in the center, composed of small dots or squares, evoking a modern and minimalistic
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
Search Foundation
Reader
Embeddings
Reranker
Elastic Inference Service
open_in_new
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
News
Intern program
Download Jina logo
open_in_new
Download Elastic logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI by Elastic © 2020-2026.