Fair Scoring for Multimodal Documents with jina-reranker-m0

Imagine you're building a sports news search system. A user searches for "tennis players celebrating championship victory" and you need to find the most relevant articles from your database. Each article contains both a text caption and an image - typical of modern sports coverage.

Your system needs to take a text query and return a ranked list of the most relevant multimodal documents from your corpus. Sounds straightforward, but there's a fundamental problem that breaks all obvious approaches.

Here's what happens when you try to rank these documents. Your embedding model say jina-clip-v2 produces similarity scores like this:

Article	Content Type	Description	Similarity Score
A	Text	Novak Djokovic wins Australian Open final in straight sets	0.72
A	Image	[photo of player holding trophy and smiling]	0.31
B	Text	Weather delays affect outdoor tournament scheduling	0.23
B	Image	[photo of tennis players jumping and celebrating]	0.54

Which article is more relevant? Article A has a high text score but low image score. Article B has a low text score but higher image score. The fundamental challenge is that you cannot compare 0.72 (text) with 0.54 (image) because these similarity scores exist on completely different scales.

tagWhen Trivial Solutions Fail

Because of the modality gap in jina-clip-v2 or in almost every other CLIP-like models, any obvious approach you might try doesn't work. If you just use the higher score, you run into the fact that text scores cluster around 0.2-0.8 while image scores cluster around 0.4-0.6. This means a mediocre text match (0.6) will always beat an excellent image match (0.5).

Averaging the scores doesn't help either. Computing (0.7 + 0.3)/2 = 0.5 gives you a number, but what does it actually mean? You're averaging fundamentally meaningless quantities. Similarly, any fixed weighting scheme is arbitrary - sometimes text matters more, sometimes images do, and this depends entirely on the specific query and document.

Even normalizing the scores first doesn't solve the core issue. You're still trying to combine fundamentally different similarity measures that capture different aspects of relevance.

tagWhat Actually Happens

EDIS: Entity-Driven Image Search over Multimodal Web Content

Making image retrieval methods practical for real-world search applications requires significant progress in dataset scales, entity comprehension, and multimodal information fusion. In this work, we introduce \textbf{E}ntity-\textbf{D}riven \textbf{I}mage \textbf{S}earch (EDIS), a challenging dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description. Unlike datasets that assume a small set of single-modality candidates, EDIS reflects real-world web image search scenarios by including a million multimodal image-text pairs as candidates. EDIS encourages the development of retrieval models that simultaneously address cross-modal information fusion and matching. To achieve accurate ranking results, a model must: 1) understand named entities and events from text queries, 2) ground entities onto images or text descriptions, and 3) effectively fuse textual and visual representations. Our experimental results show that EDIS challenges state-of-the-art methods with dense entities and a large-scale candidate set. The ablation study also proves that fusing textual features with visual features is critical in improving retrieval results.

arXiv.orgSiqi Liu

To get a better idea of what we're working with, here's an example document from the EDIS dataset, showing the image (a German football match) and the caption (One More Field Where the Content Trails Germany).

A group of fans cheers enthusiastically, holding flags and banners, with the text "One more field where the continent trails — Figure 1: Example multimodal document containing both image and text content. Since we have two modalities, for any given query there are now *two* semantic gaps (between the query and the text, and the query and the image). To get the best results, should we search the text content of the documents, or the image content?

Overall, jina-clip-v2 shows much higher similarities when comparing query-to-text than query-to-image in the EDIS dataset, in part because of the way the model was trained and in part due to the dataset itself:

Chart showing the distribution of similarity scores between a query and corpus in jina-clip-v2. Three color-coded query types — Figure 2: Similarity scores between query-to-image (in red) and query-to-text (in blue) using jina-clip-v2.

Therefore, it seems logical to retrieve a document based on its text rather than its image. And, as we can see in the graphic below, we get much better results comparing the text query ... for undocumented immigrants helping to establish legal status in the United States to the text contents of the corpus. In fact, searching by image fails to retrieve the ground truth document (highlighted in yellow) at all:

A collage of photos showing immigrant rights demonstrations, with people holding signs and long lines of people waiting to ac — Figure 3: Example where ground-truth document (highlighted with yellow border) can be retrieved only via jina-clip-v2's query-to-text retrieval when using `top_k` of 3.

But don’t be fooled. Despite query-to-text showing higher similarity scores, query-to-text and query-to-image similarity scores are not comparable. We can see this by looking at recall@10 when we use jina-clip-v2 to retrieve 32 documents from the EDIS dataset. Clearly, recall is higher with query-to-image:

	Recall@10
Query-to-text	14.55
Query-to-image	22.38

We can see this below: If we use a query from the dataset, Ear ear An elephant is decorated with Bhartiya Janta Party symbols near the BJP headquarters in New Delhi., we can retrieve the ground truth document only by its image content. Searching by its text content doesn’t return any matches:

A collage of six images related to Indian culture and politics, featuring a decorated elephant with BJP symbols and a festive — Figure 4: Example where ground-truth document (highlighted with yellow border) can be retrieved only via jina-clip-v2's query-to-image retrieval when using `top_k` of 3.

So, if similarity scores imply we should retrieve documents from their text, and recall implies we should retrieve them from their images, which should we choose? Certainly, Figures 3 and 4 suggest no outright winner. Which modality really presents the closest match between our query and the document we’re looking for? And if we want to merge candidates from both query-to-text and query-to-image retrieval, how can we meaningfully select the top matches if we can’t even compare scores? Clearly just using jina-clip-v2 won’t cut it. We need to throw another model into the mix.

tagA Simple Two-Stage Pipeline

In April 2025 we released jina-reranker-m0, a multilingual multimodal reranker for retrieving visual documents. We can see its narrower modality gap below, where jina-reranker-m0 shows comparable query-to-text and query-to-image similarity scores, contrasted with the much wider gap shown by jina-clip-v2:

Probability distribution chart of similarity scores in jina-reeanker-m0, with x-axis from -4 to 2.5 and y-axis for probabilit — Figure 6: Compared to jina-clip-v2, jina-reranker-m0 shows much less difference between query-to-image (red) and query-to-text (blue) similarity scores.

With this in mind, we can use jina-reranker-m0 for a second pass in the retrieval chain, after initial results are retrieved from jina-clip-v2:

Stage 1: Retrieve candidates from both modalities

Use jina-clip-v2 to get 16 documents via text search + 16 via image search
Accept that we can't compare scores yet

Stage 2: Unified reranking

Feed each (query + full document) pair into jina-reranker-m0
The reranker processes both text AND image together
Output: Single relevance score on a unified scale

Dark gray process flow diagram on black background showing Query, Indexing, Initial retrieval, and Reranking of text and imag — Figure 5: Indexing multimodal documents and a two-stage multimodal retrieval process with jina-clip-v2 and jina-reranker-m0.

We expanded the experiments from Table 1, now using jina-clip-v2 to retrieve documents from the corpus, then jina-reranker-m0 to rerank them:

Retrieve 32 documents via query-to-text, then rerank based on query-to-text score.
Retrieve 32 documents via query-to-image, then rerank based on query-to-image score.
Retrieve 16 documents via query-to-text and 16 via query-to-image. Rerank based on query-to-text or query-to-image score, depending on query modality.
Retrieve 16 documents via query-to-text and 16 via query-to-image. Rerank based on each document’s averaged query-to-text and query-to-image scores, giving a final score of (query-to-text + query-to-image)/2.

💡

Note that we’re measuring zero-shot performance on EDIS. We didn’t finetune either jina-clip-v2 or jina-reranker-m0 using the dataset.

Experiment	Description	Recall@10 - with jina-clip-v2	Recall@10 - with jina-reranker-m0
1	32 docs: query-to-text	14.55	17.42
2	32 docs: query-to-image	22.38	28.94
3	16 docs: query-to-text 16 docs: query-to-image	14.55	33.81
4	16 docs: query-to-text 16 docs: query-to-image Combined average reranker scores	14.55	36.24

💡

Experiments 1, 3 and 4 all show the same result for recall@10 with jina-clip-v2 due to query-to-text scores being higher than query-to-image scores. Therefore, the top ten results are dominated by the documents retrieved via text.

As we can see, by performing a second pass with jina-reranker-m0, recall goes up across the board, regardless of modality. However, we see the largest increase when we combine both textual and image content from the retrieved documents, hitting a recall@10 of 36.24. A visual example shows that jina-reranker-m0 consistently ranks the ground truth document first, whether searching text or image content:

Grid of nine panels featuring varying car interiors and exteriors, showcasing diverse designs, styles, and backgrounds in a v — Figure 7: Sample queries (on the left) and `top_k` of 1 result for each reranking methodology (four columns on the right), showing that combining image and text similarity scores consistently ranks ground truth document first.

💡

While Figures 3 and 4 show a top_k of 3 for the different retrieval methods, for reasons of space Figure 7 shows only top_k of 1 for each query.

tagConclusions

This simple two-stage approach delivers a 62% improvement in recall because the system finally leverages what humans do naturally: considering both what we read and what we see to determine relevance. The lesson extends beyond search: when dealing with multimodal AI systems, single-pass approaches that treat modalities separately will always hit this scoring incompatibility wall. Two-stage architectures that retrieve broadly then rank intelligently are becoming essential. Try jina-reranker-m0 via our API or on AWS, GCP and Azure.