News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
When Trivial Solutions Fail
What Actually Happens
A Simple Two-Stage Pipeline
Conclusions
Tech blog
May 25, 2025

Fair Scoring for Multimodal Documents with jina-reranker-m0

Text similarity: 0.7. Image similarity: 0.5. Which document is more relevant? You literally cannot tell—and that's the core problem breaking multimodal search. We solve it with unified reranking.
Nan Wang
Alex C-G
Nan Wang, Alex C-G • 8 minutes read

Imagine you're building a sports news search system. A user searches for "tennis players celebrating championship victory" and you need to find the most relevant articles from your database. Each article contains both a text caption and an image - typical of modern sports coverage.

Your system needs to take a text query and return a ranked list of the most relevant multimodal documents from your corpus. Sounds straightforward, but there's a fundamental problem that breaks all obvious approaches.

Here's what happens when you try to rank these documents. Your embedding model say jina-clip-v2 produces similarity scores like this:

Article Content Type Description Similarity Score
A Text Novak Djokovic wins Australian Open final in straight sets 0.72
A Image [photo of player holding trophy and smiling] 0.31
B Text Weather delays affect outdoor tournament scheduling 0.23
B Image [photo of tennis players jumping and celebrating] 0.54

Which article is more relevant? Article A has a high text score but low image score. Article B has a low text score but higher image score. The fundamental challenge is that you cannot compare 0.72 (text) with 0.54 (image) because these similarity scores exist on completely different scales.

tagWhen Trivial Solutions Fail

The What and Why of Text-Image Modality Gap in CLIP Models
You can’t just use a CLIP model to retrieve text and images and sort the results by score. Why? Because of the modality gap. What is it, and where does it come from?
Jina AIBo Wang, Scott Martens

Because of the modality gap in jina-clip-v2 or in almost every other CLIP-like models, any obvious approach you might try doesn't work. If you just use the higher score, you run into the fact that text scores cluster around 0.2-0.8 while image scores cluster around 0.4-0.6. This means a mediocre text match (0.6) will always beat an excellent image match (0.5).

Averaging the scores doesn't help either. Computing (0.7 + 0.3)/2 = 0.5 gives you a number, but what does it actually mean? You're averaging fundamentally meaningless quantities. Similarly, any fixed weighting scheme is arbitrary - sometimes text matters more, sometimes images do, and this depends entirely on the specific query and document.

Even normalizing the scores first doesn't solve the core issue. You're still trying to combine fundamentally different similarity measures that capture different aspects of relevance.

tagWhat Actually Happens

EDIS: Entity-Driven Image Search over Multimodal Web Content
Making image retrieval methods practical for real-world search applications requires significant progress in dataset scales, entity comprehension, and multimodal information fusion. In this work, we introduce \textbf{E}ntity-\textbf{D}riven \textbf{I}mage \textbf{S}earch (EDIS), a challenging dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description. Unlike datasets that assume a small set of single-modality candidates, EDIS reflects real-world web image search scenarios by including a million multimodal image-text pairs as candidates. EDIS encourages the development of retrieval models that simultaneously address cross-modal information fusion and matching. To achieve accurate ranking results, a model must: 1) understand named entities and events from text queries, 2) ground entities onto images or text descriptions, and 3) effectively fuse textual and visual representations. Our experimental results show that EDIS challenges state-of-the-art methods with dense entities and a large-scale candidate set. The ablation study also proves that fusing textual features with visual features is critical in improving retrieval results.
arXiv.orgSiqi Liu

To get a better idea of what we're working with, here's an example document from the EDIS dataset, showing the image (a German football match) and the caption (One More Field Where the Content Trails Germany).

Figure 1: Example multimodal document containing both image and text content. Since we have two modalities, for any given query there are now two semantic gaps (between the query and the text, and the query and the image). To get the best results, should we search the text content of the documents, or the image content?

Overall, jina-clip-v2 shows much higher similarities when comparing query-to-text than query-to-image in the EDIS dataset, in part because of the way the model was trained and in part due to the dataset itself:

Figure 2: Similarity scores between query-to-image (in red) and query-to-text (in blue) using jina-clip-v2.

Therefore, it seems logical to retrieve a document based on its text rather than its image. And, as we can see in the graphic below, we get much better results comparing the text query ... for undocumented immigrants helping to establish legal status in the United States to the text contents of the corpus. In fact, searching by image fails to retrieve the ground truth document (highlighted in yellow) at all:

Figure 3: Example where ground-truth document (highlighted with yellow border) can be retrieved only via jina-clip-v2's query-to-text retrieval when using top_k of 3.

But don’t be fooled. Despite query-to-text showing higher similarity scores, query-to-text and query-to-image similarity scores are not comparable. We can see this by looking at recall@10 when we use jina-clip-v2 to retrieve 32 documents from the EDIS dataset. Clearly, recall is higher with query-to-image:

Recall@10
Query-to-text 14.55
Query-to-image 22.38

We can see this below: If we use a query from the dataset, Ear ear An elephant is decorated with Bhartiya Janta Party symbols near the BJP headquarters in New Delhi., we can retrieve the ground truth document only by its image content. Searching by its text content doesn’t return any matches:

Figure 4: Example where ground-truth document (highlighted with yellow border) can be retrieved only via jina-clip-v2's query-to-image retrieval when using top_k of 3.

So, if similarity scores imply we should retrieve documents from their text, and recall implies we should retrieve them from their images, which should we choose? Certainly, Figures 3 and 4 suggest no outright winner. Which modality really presents the closest match between our query and the document we’re looking for? And if we want to merge candidates from both query-to-text and query-to-image retrieval, how can we meaningfully select the top matches if we can’t even compare scores? Clearly just using jina-clip-v2 won’t cut it. We need to throw another model into the mix.

tagA Simple Two-Stage Pipeline

In April 2025 we released jina-reranker-m0, a multilingual multimodal reranker for retrieving visual documents. We can see its narrower modality gap below, where jina-reranker-m0 shows comparable query-to-text and query-to-image similarity scores, contrasted with the much wider gap shown by jina-clip-v2:

Figure 6: Compared to jina-clip-v2, jina-reranker-m0 shows much less difference between query-to-image (red) and query-to-text (blue) similarity scores.

With this in mind, we can use jina-reranker-m0 for a second pass in the retrieval chain, after initial results are retrieved from jina-clip-v2:

Stage 1: Retrieve candidates from both modalities

  • Use jina-clip-v2 to get 16 documents via text search + 16 via image search
  • Accept that we can't compare scores yet

Stage 2: Unified reranking

  • Feed each (query + full document) pair into jina-reranker-m0
  • The reranker processes both text AND image together
  • Output: Single relevance score on a unified scale
Figure 5: Indexing multimodal documents and a two-stage multimodal retrieval process with jina-clip-v2 and jina-reranker-m0.

We expanded the experiments from Table 1, now using jina-clip-v2 to retrieve documents from the corpus, then jina-reranker-m0 to rerank them:

  1. Retrieve 32 documents via query-to-text, then rerank based on query-to-text score.
  2. Retrieve 32 documents via query-to-image, then rerank based on query-to-image score.
  3. Retrieve 16 documents via query-to-text and 16 via query-to-image. Rerank based on query-to-text or query-to-image score, depending on query modality.
  4. Retrieve 16 documents via query-to-text and 16 via query-to-image. Rerank based on each document’s averaged query-to-text and query-to-image scores, giving a final score of (query-to-text + query-to-image)/2.
💡
Note that we’re measuring zero-shot performance on EDIS. We didn’t finetune either jina-clip-v2 or jina-reranker-m0 using the dataset.
Experiment Description Recall@10 - with jina-clip-v2 Recall@10 - with jina-reranker-m0
1 32 docs: query-to-text 14.55 17.42
2 32 docs: query-to-image 22.38 28.94
3 16 docs: query-to-text
16 docs: query-to-image
14.55 33.81
4 16 docs: query-to-text
16 docs: query-to-image
Combined average reranker scores
14.55 36.24
💡
Experiments 1, 3 and 4 all show the same result for recall@10 with jina-clip-v2 due to query-to-text scores being higher than query-to-image scores. Therefore, the top ten results are dominated by the documents retrieved via text.

As we can see, by performing a second pass with jina-reranker-m0, recall goes up across the board, regardless of modality. However, we see the largest increase when we combine both textual and image content from the retrieved documents, hitting a recall@10 of 36.24. A visual example shows that jina-reranker-m0 consistently ranks the ground truth document first, whether searching text or image content:

Figure 7: Sample queries (on the left) and top_k of 1 result for each reranking methodology (four columns on the right), showing that combining image and text similarity scores consistently ranks ground truth document first.
💡
While Figures 3 and 4 show a top_k of 3 for the different retrieval methods, for reasons of space Figure 7 shows only top_k of 1 for each query.

tagConclusions

This simple two-stage approach delivers a 62% improvement in recall because the system finally leverages what humans do naturally: considering both what we read and what we see to determine relevance. The lesson extends beyond search: when dealing with multimodal AI systems, single-pass approaches that treat modalities separately will always hit this scoring incompatibility wall. Two-stage architectures that retrieve broadly then rank intelligently are becoming essential. Try jina-reranker-m0 via our API or on AWS, GCP and Azure.

Categories:
Tech blog
rss_feed

Read more
May 07, 2025 • 9 minutes read
Model Soup’s Recipe for Embeddings
Bo Wang
Scott Martens
Still life drawing of a purple bowl filled with apples and oranges on a white table. The scene features rich colors against a
April 16, 2025 • 10 minutes read
On the Size Bias of Text Embeddings and Its Impact in Search
Scott Martens
Black background with a simple white ruler marked in centimeters, emphasizing a minimalist design.
April 01, 2025 • 17 minutes read
Using DeepSeek R1 Reasoning Model in DeepSearch
Andrei Ungureanu
Alex C-G
Brown background with a stylized whale graphic and the text "THINK:" and ":SEARCH>" in code-like font.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.