News
Models
API
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
Elastic Inference Service
Run Jina models natively inside Elasticsearch.
MCP terminalCLIarticlellms.txtsmart_toyAgentsdata_objectSchemamenu_bookDocs



Log in
login
Embeddings
copyright CC BY-NC 4.0
open_in_new Release Post

jina-embeddings-v5-omni-small

Multimodal embeddings for text, image, audio, video, and PDF
License
copyright CC-BY-NC-4.0
Release Date
calendar_month
2026-05-07
Input
abc
Text
image
Image
audiotrack
Audio
videocam
Video
picture_as_pdf
PDF
arrow_forward
Output
more_horiz
Vector
Matryoshka Dimensions help_outline
32
64
128
256
512
768
1024
Late Chunking help_outline
cancel
No
Model Details
Parameters: 1.7B
Input Token Length: 32K
Output Dimension: 1024
Base Model help_outline
open_in_new
jina-embeddings-v5-text-small
Trained Languages help_outline
32 languages
Supported Languages help_outline
93 languages
Quantizations help_outline
GGUF
Related Models
link
jina-embeddings-v5-omni-nano
link
jina-embeddings-v5-text-small
link
jina-embeddings-v3
link
jina-clip-v2
Supported Tasks
search Retrieval
compare_arrows Text Matching
bubble_chart Clustering
label Classification
Tags
multimodal-embedding
embeddings
multilingual
long-context
production
matryoshka
last-token-pooling
visual-document-retrieval
Available via
Elastic Inference ServiceJina APIHugging Face
I/O graph 1

Text

jina-embeddings-v5-omni-small

Image

Task

Vector

I/O graph 2

Text

jina-embeddings-v5-omni-small

Audio

Task

Vector

I/O graph 3

Text

jina-embeddings-v5-omni-small

Video

Task

Vector

I/O graph 4

multiple

Vector

Text

jina-embeddings-v5-omni-small

PDF

Task

Choose models to compare
Publications (1)
arXiv
May 11, 2026
jina-embeddings-v5-omni: Text-Geometry-Preserving Multimodal Embeddings via Frozen-Tower Composition

Overview

jina-embeddings-v5-omni-small (~1.74B parameters) is a multimodal embedding model that accepts text, images, video, and audio and produces embeddings in a shared vector space aligned with jina-embeddings-v5-text-small. You can index with text and query with any modality, or vice versa, without reindexing. The text backbone and all four task-specific LoRA adapters (retrieval, text-matching, clustering, classification) are frozen during multimodal training, so text-only outputs are bit-identical to jina-embeddings-v5-text-small. The model produces 1024-dimensional embeddings with Matryoshka truncation down to 32 dimensions and supports 32K token context length.

Methods

Trained in a third stage extending jina-embeddings-v5-text-small. The text backbone and all four task-specific LoRA adapters are frozen; only the cross-modal projectors are newly trained. A SigLIP2 So400m vision encoder handles images and video (32 uniformly sampled frames). A Whisper-large-v3 audio encoder handles audio input. PDF pages are rendered as images and processed through the vision pathway. Training uses contrastive loss with cross-modal hard negatives to align visual and audio representations with the existing text embedding space.

Performance

Text-only performance is bit-identical to jina-embeddings-v5-text-small — the text backbone and LoRA adapters are untouched during multimodal training. On cross-modal retrieval, the model demonstrates strong alignment across text-image, text-audio, and text-video tasks. PDF page retrieval is handled through the vision pathway. The omni-small model offers the best accuracy-efficiency tradeoff among Jina multimodal embedding models for server deployment.

Best Practice

Same four LoRA adapters as v5-text-small: retrieval, text-matching, clustering, and classification. For multimodal inputs via the API, pass image URLs, audio file URLs, video file URLs, or PDF URLs directly — the model routes each modality through the appropriate encoder. Supported audio formats include WAV, MP3, FLAC, OGG, M4A, and Opus. Video inputs are processed as 32 uniformly sampled frames. Mix modalities freely within a single batch: the embedding space is shared across all modalities. Use cosine similarity for comparison. Matryoshka truncation from 1024 to 32 dimensions is supported. Text-only embeddings are drop-in compatible with jina-embeddings-v5-text-small — no reindexing needed when upgrading.
Blogs that mention this model
May 12, 2026 • 7 minutes read
jina-embeddings-v5-omni: Embeddings for Text, Image, Audio and Video
One model, four modalities: text, image, audio, video. Best-in-class omni embeddings in 1.6B and 0.9B.
Han Xiao
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
Search Foundation
Reader
Embeddings
Reranker
Elastic Inference Service
open_in_new
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
News
Intern program
Download Jina logo
open_in_new
Download Elastic logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI by Elastic © 2020-2026.