Bootstrapping Audio Embeddings from Multimodal LLMs

Google recently released Gemini Embedding 2, their first natively multimodal embedding model. Text, images, video, audio, documents, all mapped into a single 3072-dimensional vector space. This is part of a broader trend toward omni embedding models: unified models that handle all modalities in one architecture, from jina-embeddings-v4 to Omni-Embed-Nemotron to Omni-5.

Han Xiao presenting at EMNLP 2025 BoF session — At last year's EMNLP BoF session, I presented omni models as one of the key directions for dense retrieval in 2026.

What caught our eye was audio. Most people hear "multimodal embedding" and think images, maybe video. Audio is the forgotten modality: harder to collect, harder to label, and fewer people working on it. At Jina AI we had explored exactly this problem, building small (<1.2B parameter) audio embedding models as part of our work toward omni embeddings. Gemini Embedding 2's release is a good moment to share what we learned along the way.

Audio embeddings

An audio embedding is a fixed-length vector representation of an audio clip. Given a raw waveform, the model produces a dense vector (typically 768 to 3072 dimensions) that captures the semantic content of the sound. Two clips with similar meaning produce similar embeddings, and an audio clip sits close to its text description in the shared embedding space. This is one piece of the omni embedding puzzle: once you can embed audio alongside text and images in the same vector space, you unlock cross-modal retrieval across all modalities.

The dominant approach since 2022 is Contrastive Language-Audio Pretraining (CLAP), extending CLIP to audio. LAION-CLAP scaled this with 630K pairs and feature fusion. The strongest variant (Elizalde et al., 2023) trained on 4.6M pairs using an audio encoder across 22 diverse audio tasks paired with an autoregressive decoder, achieving 42.0 cvR@5 on AudioCaps at 250M parameters.

We take a different route: turn multimodal LLMs that already understand audio into embedding models.

Architecture

First, what goes in and what comes out. The input is raw audio: a waveform decoded from any standard format (WAV, MP3, FLAC) and resampled to 16kHz mono. The audio encoder converts this waveform into a 128-bin log-mel spectrogram, then processes it into a sequence of feature tokens at roughly 150 tokens per second. A 10-second clip becomes about 1,500 tokens. Maximum input length is 30 seconds; longer audio needs to be chunked. The output is a single dense vector (the embedding), typically 896 to 3584 dimensions depending on the LLM backbone size.

We start from Qwen2.5-Omni, a multimodal LLM with native audio understanding. Three components: an audio encoder (~0.6-0.8B params) that converts waveforms to feature vectors via a ~4.5M linear projection, an LLM backbone (0.5-7B params) that processes both audio features and text tokens with each transformer layer adding ~0.2B parameters, and a pooling layer that mean-pools the last hidden state into a single embedding vector. Both modalities share the same LLM backbone, so they are already roughly aligned from pretraining.

Architecture comparison — Left: standard approach, finetuning the full MLLM end-to-end (3-7B parameters). Right: module combination, pairing a pretrained audio encoder with a smaller LLM backbone. The shared backbone processes both audio features and text tokens, producing embeddings in the same vector space.

Training objective is InfoNCE contrastive loss. Each modality is encoded independently, loss is computed in both directions and averaged:

def training_step(audio_batch, text_batch):
    audio_embeds = model.encode_audio(audio_batch)  # [B, D]
    text_embeds = model.encode_text(text_batch)      # [B, D]
    
    audio_embeds = F.normalize(audio_embeds, dim=-1)
    text_embeds = F.normalize(text_embeds, dim=-1)
    
    sim = audio_embeds @ text_embeds.T / temperature  # [B, B]
    labels = torch.arange(len(sim), device=sim.device)
    loss = (F.cross_entropy(sim, labels) + 
            F.cross_entropy(sim.T, labels)) / 2
    return loss

Training data

Five audio-text pair datasets, 181K samples total:

Dataset	Samples	Description
AudioSetStrong	108K	Temporally-labeled events, GPT-generated captions (subset of AudioSet)
FSD50K	41K	Human-labeled sound events, 200 classes
Clotho	19K	Audio captioning, detailed descriptions
UrbanSound8K	9K	Urban sound classification
MACS	4K	Urban acoustic scenes

CLAP used the full AudioSet (2M+ audios) plus other sources totaling 4.6M pairs. We use only AudioSetStrong (~100K). Starting from a pretrained MLLM drastically reduces the data needed.

def load_sample(audio_path, caption):
    waveform, sr = torchaudio.load(audio_path)
    waveform = torchaudio.transforms.Resample(sr, 16000)(waveform)
    audio_inputs = processor.feature_extractor(
        waveform, sampling_rate=16000, return_tensors="pt"
    )
    text_inputs = processor.tokenizer(caption, padding=True, return_tensors="pt")
    return audio_inputs, text_inputs

Four approaches

Goal: audio embedding model under 1.2B that beats CLAP.

Full model finetuning. Qwen2.5-Omni-7B on audio-text pairs: AudioCaps T2A cvR@5 = 63.2, Clotho T2A = 39.2. This is the upper bound, but 7B is not deployable. Tevatron 2.0 similarly finetuned on AudioCaps alone (61.2 but only 11.9 on Clotho, showing poor generalization from single-dataset training). ColQwen-Omni finetuned on visual-document tasks with zero audio data, achieving 37.4 through cross-modal transfer.

Layer pruning. Remove transformer layers from the 7B model. Each layer is ~0.2B, so a 10-layer model has ~3.5B total.

Layer pruning effect — Performance vs model size as transformer layers are removed. AudioCaps (red) degrades from 63.2 to 56.0 cvR@5 going from 20 to 5 layers. All configurations still beat the CLAP baseline (dashed). Even at 5 layers (2.3B params), the model cannot reach the 1B target.

Layers	Params	AudioCaps T2A cvR@5	Clotho T2A cvR@5
20	5.8B	63.2	39.2
10	3.5B	58.2	36.5
5	2.3B	56.0	36.0

Batch size (32, 64, 128) made no meaningful difference. Larger batches help initially but can degrade later: batch 128 hit 31.3 NDCG at 2K steps on Clotho but dropped to 29.3 at 10K steps.

Text-only modality transfer. Finetune only on text pairs (MultiNLI, SNLI, FEVER, SciFact), relying on pretrained cross-modal alignment. Worked on the full 7B (AudioCaps 46.1, beating CLAP's 42.0) but completely failed on the pruned 10-layer model (cvR@5 = 5.9). The cross-modal wiring is distributed across the full network and does not survive pruning.

Module combination. The breakthrough: take an audio encoder from one model and a small LLM from another, even across model families. Qwen2.5-Omni trains in three stages: (1) audio/vision encoders with frozen LLM, (2) all parameters unfrozen, (3) 32K context. We combine modules from different stages:

Config	Audio Encoder	LLM	Params
M1	Qwen3-Omni (0.6B, pre-stage-1)	Qwen2.5-0.5B	1.1B
M2	Qwen3-Omni (0.6B, pre-stage-1)	Qwen2.5-3B	3.6B
M3	Qwen2.5-Omni-3B (0.8B, post-stage-3)	Qwen2.5-3B	3.8B
M4	Qwen2.5-Omni-3B (full)	Full 3B	3.8B

Implementation detail: Qwen3-Omni uses Qwen3OmniMoePreTrainedModel while standalone Qwen3 uses Qwen3ForCausalLM. We initialize an Omni model shell with matching dimensions and copy weights into corresponding locations.

Module combination results — AudioCaps and Clotho T2A cvR@5 for each configuration. M1 at 1.1B parameters reaches 49.7 on AudioCaps, beating CLAP (42.0) by 18%. Using a better-aligned audio encoder from post-stage-3 training improves results (M3 vs M2: +4.2 on AudioCaps). Stages 2-3 of LLM pretraining are not critical for embedding quality (M3 vs M4 is within noise).

Evaluation

Evaluating audio embeddings is fundamentally about retrieval quality: given a text query, can the model find the right audio clip? The key challenge is that "right" depends on the dataset. AudioCaps has concrete descriptions ("a man speaking followed by a door closing"), while Clotho has abstract captions ("a quiet atmosphere with distant rumbling"). A model that memorizes surface-level audio features will do well on AudioCaps but struggle on Clotho. We care most about generalization across description styles.

CV-Recall@5 (cvR@5): for each text query, check if any correct audio clip appears in the top-5 results. Binary score averaged over all queries. Standard metric in MTEB audio retrieval.

def evaluate_cvr_at_k(model, dataset, k=5):
    audio_embeds = model.encode_audio(dataset.audio_clips)
    text_embeds = model.encode_text(dataset.text_queries)
    sim = F.normalize(audio_embeds) @ F.normalize(text_embeds).T
    
    hits = 0
    for i in range(len(dataset.text_queries)):
        top_k = sim[:, i].argsort(descending=True)[:k]
        if dataset.ground_truth[i] in top_k:
            hits += 1
    return hits / len(dataset.text_queries)

Three evaluation datasets from MTEB: AudioCaps (video-derived, human captions), AudioSetStrong (temporally-labeled, GPT descriptions), Clotho (diverse, abstract captions). CLAP used the full AudioSet (2M+) while we used AudioSetStrong (~100K), which partly explains CLAP's edge on that benchmark.

Full results comparison — Horizontal bar chart comparing all model configurations across AudioCaps, AudioSetStrong, and Clotho T2A retrieval. Red dashed line marks the CLAP baseline. Module combination models (green) achieve strong results at much smaller sizes. The full 7B finetuned model (dark blue) sets the upper bound.

Applications

Audio embeddings are becoming relevant beyond traditional retrieval. In agentic systems, audio embeddings enable intent routing: an agent that receives voice input can embed the audio and route it to the right tool or sub-agent based on semantic similarity, without waiting for full transcription. Sound event classification powers real-time monitoring in industrial settings, smart home automation, and security systems. In multimodal agent workflows, audio embeddings let agents search, compare, and reason over audio content the same way they already handle text and images. Music and media applications use them for similarity search, copyright detection, and content recommendation. As voice interfaces become the default interaction mode for AI agents, compact audio embeddings that run on-device become critical for low-latency, privacy-preserving applications.

Conclusions

Starting from a pretrained MLLM is the single biggest lever. It provides cross-modal alignment, a strong text encoder, and a capable audio encoder in one package. Module combination is the most promising direction: mixing audio encoders and LLMs from different models and training stages opens a design space barely explored. Our models dominate AudioCaps but only match CLAP on Clotho, whose abstract descriptions expose weaknesses that AudioCaps misses. Cross-modal transfer does not survive model compression.

This work is one step toward the omni embedding model: a single model that embeds text, images, audio, video, and documents into a unified retrieval space. The module combination approach shows that you can bootstrap new modalities efficiently by reusing pretrained components. Next steps include MoE architectures with <500M activation parameters, combining module combination with modality transfer, and scaling data with WavCaps, MusicCaps, and speech datasets.