News
Models
API
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
MCP articlellms.txtsmart_toyAgentsdata_objectSchemachild_careHumans



Log in
login
Tech blog
March 11, 2026

Bootstrapping Audio Embeddings from Multimodal LLMs

Turn any multimodal LLM into a small audio embedding model that beats CLAP with 25x less data.
Han Xiao
Han Xiao • 7 minutes read
jina-ai/audio-embedding-kickstarter
GitHubjina-ai

Google recently released Gemini Embedding 2, their first natively multimodal embedding model. Text, images, video, audio, documents, all mapped into a single 3072-dimensional vector space. This is part of a broader trend toward omni embedding models: unified models that handle all modalities in one architecture, from jina-embeddings-v4 to Omni-Embed-Nemotron to Omni-5.

Han Xiao presenting at EMNLP 2025 BoF session
At last year's EMNLP BoF session, I presented omni models as one of the key directions for dense retrieval in 2026.

What caught our eye was audio. Most people hear "multimodal embedding" and think images, maybe video. Audio is the forgotten modality: harder to collect, harder to label, and fewer people working on it. At Jina AI we had explored exactly this problem, building small (<1.2B parameter) audio embedding models as part of our work toward omni embeddings. Gemini Embedding 2's release is a good moment to share what we learned along the way.

Audio embeddings

An audio embedding is a fixed-length vector representation of an audio clip. Given a raw waveform, the model produces a dense vector (typically 768 to 3072 dimensions) that captures the semantic content of the sound. Two clips with similar meaning produce similar embeddings, and an audio clip sits close to its text description in the shared embedding space. This is one piece of the omni embedding puzzle: once you can embed audio alongside text and images in the same vector space, you unlock cross-modal retrieval across all modalities.

The dominant approach since 2022 is Contrastive Language-Audio Pretraining (CLAP), extending CLIP to audio. LAION-CLAP scaled this with 630K pairs and feature fusion. The strongest variant (Elizalde et al., 2023) trained on 4.6M pairs using an audio encoder across 22 diverse audio tasks paired with an autoregressive decoder, achieving 42.0 cvR@5 on AudioCaps at 250M parameters.

We take a different route: turn multimodal LLMs that already understand audio into embedding models.

Architecture

First, what goes in and what comes out. The input is raw audio: a waveform decoded from any standard format (WAV, MP3, FLAC) and resampled to 16kHz mono. The audio encoder converts this waveform into a 128-bin log-mel spectrogram, then processes it into a sequence of feature tokens at roughly 150 tokens per second. A 10-second clip becomes about 1,500 tokens. Maximum input length is 30 seconds; longer audio needs to be chunked. The output is a single dense vector (the embedding), typically 896 to 3584 dimensions depending on the LLM backbone size.

We start from Qwen2.5-Omni, a multimodal LLM with native audio understanding. Three components: an audio encoder (~0.6-0.8B params) that converts waveforms to feature vectors via a ~4.5M linear projection, an LLM backbone (0.5-7B params) that processes both audio features and text tokens with each transformer layer adding ~0.2B parameters, and a pooling layer that mean-pools the last hidden state into a single embedding vector. Both modalities share the same LLM backbone, so they are already roughly aligned from pretraining.

Architecture comparison
Left: standard approach, finetuning the full MLLM end-to-end (3-7B parameters). Right: module combination, pairing a pretrained audio encoder with a smaller LLM backbone. The shared backbone processes both audio features and text tokens, producing embeddings in the same vector space.

Training objective is InfoNCE contrastive loss. Each modality is encoded independently, loss is computed in both directions and averaged:

def training_step(audio_batch, text_batch):
    audio_embeds = model.encode_audio(audio_batch)  # [B, D]
    text_embeds = model.encode_text(text_batch)      # [B, D]
    
    audio_embeds = F.normalize(audio_embeds, dim=-1)
    text_embeds = F.normalize(text_embeds, dim=-1)
    
    sim = audio_embeds @ text_embeds.T / temperature  # [B, B]
    labels = torch.arange(len(sim), device=sim.device)
    loss = (F.cross_entropy(sim, labels) + 
            F.cross_entropy(sim.T, labels)) / 2
    return loss

Training data

Five audio-text pair datasets, 181K samples total:

DatasetSamplesDescription
AudioSetStrong108KTemporally-labeled events, GPT-generated captions (subset of AudioSet)
FSD50K41KHuman-labeled sound events, 200 classes
Clotho19KAudio captioning, detailed descriptions
UrbanSound8K9KUrban sound classification
MACS4KUrban acoustic scenes

CLAP used the full AudioSet (2M+ audios) plus other sources totaling 4.6M pairs. We use only AudioSetStrong (~100K). Starting from a pretrained MLLM drastically reduces the data needed.

def load_sample(audio_path, caption):
    waveform, sr = torchaudio.load(audio_path)
    waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)
    audio_inputs = processor.feature_extractor(
        waveform, sampling_rate=16000, return_tensors="pt"
    )
    text_inputs = processor.tokenizer(caption, padding=True, return_tensors="pt")
    return audio_inputs, text_inputs

Four approaches

Goal: audio embedding model under 1.2B that beats CLAP.

Full model finetuning. Qwen2.5-Omni-7B on audio-text pairs: AudioCaps T2A cvR@5 = 63.2, Clotho T2A = 39.2. This is the upper bound, but 7B is not deployable. Tevatron 2.0 similarly finetuned on AudioCaps alone (61.2 but only 11.9 on Clotho, showing poor generalization from single-dataset training). ColQwen-Omni finetuned on visual-document tasks with zero audio data, achieving 37.4 through cross-modal transfer.

Layer pruning. Remove transformer layers from the 7B model. Each layer is ~0.2B, so a 10-layer model has ~3.5B total.

Layer pruning effect
Performance vs model size as transformer layers are removed. AudioCaps (red) degrades from 63.2 to 56.0 cvR@5 going from 20 to 5 layers. All configurations still beat the CLAP baseline (dashed). Even at 5 layers (2.3B params), the model cannot reach the 1B target.
LayersParamsAudioCaps T2A cvR@5Clotho T2A cvR@5
205.8B63.239.2
103.5B58.236.5
52.3B56.036.0

Batch size (32, 64, 128) made no meaningful difference. Larger batches help initially but can degrade later: batch 128 hit 31.3 NDCG at 2K steps on Clotho but dropped to 29.3 at 10K steps.

Text-only modality transfer. Finetune only on text pairs (MultiNLI, SNLI, FEVER, SciFact), relying on pretrained cross-modal alignment. Worked on the full 7B (AudioCaps 46.1, beating CLAP's 42.0) but completely failed on the pruned 10-layer model (cvR@5 = 5.9). The cross-modal wiring is distributed across the full network and does not survive pruning.

Module combination. The breakthrough: take an audio encoder from one model and a small LLM from another, even across model families. Qwen2.5-Omni trains in three stages: (1) audio/vision encoders with frozen LLM, (2) all parameters unfrozen, (3) 32K context. We combine modules from different stages:

ConfigAudio EncoderLLMParams
M1Qwen3-Omni (0.6B, pre-stage-1)Qwen2.5-0.5B1.1B
M2Qwen3-Omni (0.6B, pre-stage-1)Qwen2.5-3B3.6B
M3Qwen2.5-Omni-3B (0.8B, post-stage-3)Qwen2.5-3B3.8B
M4Qwen2.5-Omni-3B (full)Full 3B3.8B

Implementation detail: Qwen3-Omni uses Qwen3OmniMoePreTrainedModel while standalone Qwen3 uses Qwen3ForCausalLM. We initialize an Omni model shell with matching dimensions and copy weights into corresponding locations.

Module combination results
AudioCaps and Clotho T2A cvR@5 for each configuration. M1 at 1.1B parameters reaches 49.7 on AudioCaps, beating CLAP (42.0) by 18%. Using a better-aligned audio encoder from post-stage-3 training improves results (M3 vs M2: +4.2 on AudioCaps). Stages 2-3 of LLM pretraining are not critical for embedding quality (M3 vs M4 is within noise).

Evaluation

Evaluating audio embeddings is fundamentally about retrieval quality: given a text query, can the model find the right audio clip? The key challenge is that "right" depends on the dataset. AudioCaps has concrete descriptions ("a man speaking followed by a door closing"), while Clotho has abstract captions ("a quiet atmosphere with distant rumbling"). A model that memorizes surface-level audio features will do well on AudioCaps but struggle on Clotho. We care most about generalization across description styles.

CV-Recall@5 (cvR@5): for each text query, check if any correct audio clip appears in the top-5 results. Binary score averaged over all queries. Standard metric in MTEB audio retrieval.

def evaluate_cvr_at_k(model, dataset, k=5):
    audio_embeds = model.encode_audio(dataset.audio_clips)
    text_embeds = model.encode_text(dataset.text_queries)
    sim = F.normalize(audio_embeds) @ F.normalize(text_embeds).T
    
    hits = 0
    for i in range(len(dataset.text_queries)):
        top_k = sim[:, i].argsort(descending=True)[:k]
        if dataset.ground_truth[i] in top_k:
            hits += 1
    return hits / len(dataset.text_queries)

Three evaluation datasets from MTEB: AudioCaps (video-derived, human captions), AudioSetStrong (temporally-labeled, GPT descriptions), Clotho (diverse, abstract captions). CLAP used the full AudioSet (2M+) while we used AudioSetStrong (~100K), which partly explains CLAP's edge on that benchmark.

Full results comparison
Horizontal bar chart comparing all model configurations across AudioCaps, AudioSetStrong, and Clotho T2A retrieval. Red dashed line marks the CLAP baseline. Module combination models (green) achieve strong results at much smaller sizes. The full 7B finetuned model (dark blue) sets the upper bound.

Applications

Audio embeddings are becoming relevant beyond traditional retrieval. In agentic systems, audio embeddings enable intent routing: an agent that receives voice input can embed the audio and route it to the right tool or sub-agent based on semantic similarity, without waiting for full transcription. Sound event classification powers real-time monitoring in industrial settings, smart home automation, and security systems. In multimodal agent workflows, audio embeddings let agents search, compare, and reason over audio content the same way they already handle text and images. Music and media applications use them for similarity search, copyright detection, and content recommendation. As voice interfaces become the default interaction mode for AI agents, compact audio embeddings that run on-device become critical for low-latency, privacy-preserving applications.

Conclusions

Starting from a pretrained MLLM is the single biggest lever. It provides cross-modal alignment, a strong text encoder, and a capable audio encoder in one package. Module combination is the most promising direction: mixing audio encoders and LLMs from different models and training stages opens a design space barely explored. Our models dominate AudioCaps but only match CLAP on Clotho, whose abstract descriptions expose weaknesses that AudioCaps misses. Cross-modal transfer does not survive model compression.

This work is one step toward the omni embedding model: a single model that embeds text, images, audio, video, and documents into a unified retrieval space. The module combination approach shows that you can bootstrap new modalities efficiently by reusing pretrained components. Next steps include MoE architectures with <500M activation parameters, combining module combination with modality transfer, and scaling data with WavCaps, MusicCaps, and speech datasets.

Categories:
Tech blog
rss_feed

Read more
March 06, 2026 • 6 minutes read
Identifying Embedding Models from Raw Numerical Values
Han Xiao
Fingerprint illustration made from numbers, showcasing digital and high-tech design on a light background.
September 09, 2025 • 11 minutes read
Multimodal Embeddings in Llama.cpp and GGUF
Andrei Ungureanu
Alex C-G
Cartoon llama in the center of a white background, emitting laser-like beams from its eyes. The illustration creates a playfu
August 29, 2025 • 9 minutes read
Agentic Workflow with Jina Remote MCP Server
Alex C-G
Digital map of Europe formed with binary code in shades of blue, grey, and white, with red, yellow, and blue highlights in so
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
Search Foundation
Reader
Embeddings
Reranker
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
News
Intern program
Download Jina logo
open_in_new
Download Elastic logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI by Elastic © 2020-2026.