Imagine you're building a sports news search system. A user searches for "tennis players celebrating championship victory" and you need to find the most relevant articles from your database. Each article contains both a text caption and an image - typical of modern sports coverage.
Your system needs to take a text query and return a ranked list of the most relevant multimodal documents from your corpus. Sounds straightforward, but there's a fundamental problem that breaks all obvious approaches.
Here's what happens when you try to rank these documents. Your embedding model say jina-clip-v2 produces similarity scores like this:
Article | Content Type | Description | Similarity Score |
---|---|---|---|
A | Text | Novak Djokovic wins Australian Open final in straight sets | 0.72 |
A | Image | [photo of player holding trophy and smiling] | 0.31 |
B | Text | Weather delays affect outdoor tournament scheduling | 0.23 |
B | Image | [photo of tennis players jumping and celebrating] | 0.54 |
Which article is more relevant? Article A has a high text score but low image score. Article B has a low text score but higher image score. The fundamental challenge is that you cannot compare 0.72 (text) with 0.54 (image) because these similarity scores exist on completely different scales.
tagWhen Trivial Solutions Fail

Because of the modality gap in jina-clip-v2 or in almost every other CLIP-like models, any obvious approach you might try doesn't work. If you just use the higher score, you run into the fact that text scores cluster around 0.2-0.8 while image scores cluster around 0.4-0.6. This means a mediocre text match (0.6) will always beat an excellent image match (0.5).
Averaging the scores doesn't help either. Computing (0.7 + 0.3)/2 = 0.5 gives you a number, but what does it actually mean? You're averaging fundamentally meaningless quantities. Similarly, any fixed weighting scheme is arbitrary - sometimes text matters more, sometimes images do, and this depends entirely on the specific query and document.
Even normalizing the scores first doesn't solve the core issue. You're still trying to combine fundamentally different similarity measures that capture different aspects of relevance.
tagWhat Actually Happens

To get a better idea of what we're working with, here's an example document from the EDIS dataset, showing the image (a German football match) and the caption (One More Field Where the Content Trails Germany
).

Overall, jina-clip-v2 shows much higher similarities when comparing query-to-text than query-to-image in the EDIS dataset, in part because of the way the model was trained and in part due to the dataset itself:

Therefore, it seems logical to retrieve a document based on its text rather than its image. And, as we can see in the graphic below, we get much better results comparing the text query ... for undocumented immigrants helping to establish legal status in the United States
to the text contents of the corpus. In fact, searching by image fails to retrieve the ground truth document (highlighted in yellow) at all:

top_k
of 3.But don’t be fooled. Despite query-to-text showing higher similarity scores, query-to-text and query-to-image similarity scores are not comparable. We can see this by looking at recall@10 when we use jina-clip-v2 to retrieve 32 documents from the EDIS dataset. Clearly, recall is higher with query-to-image:
Recall@10 | |
---|---|
Query-to-text | 14.55 |
Query-to-image | 22.38 |
We can see this below: If we use a query from the dataset, Ear ear An elephant is decorated with Bhartiya Janta Party symbols near the BJP headquarters in New Delhi.
, we can retrieve the ground truth document only by its image content. Searching by its text content doesn’t return any matches:

top_k
of 3.So, if similarity scores imply we should retrieve documents from their text, and recall implies we should retrieve them from their images, which should we choose? Certainly, Figures 3 and 4 suggest no outright winner. Which modality really presents the closest match between our query and the document we’re looking for? And if we want to merge candidates from both query-to-text and query-to-image retrieval, how can we meaningfully select the top matches if we can’t even compare scores? Clearly just using jina-clip-v2 won’t cut it. We need to throw another model into the mix.
tagA Simple Two-Stage Pipeline
In April 2025 we released jina-reranker-m0, a multilingual multimodal reranker for retrieving visual documents. We can see its narrower modality gap below, where jina-reranker-m0 shows comparable query-to-text and query-to-image similarity scores, contrasted with the much wider gap shown by jina-clip-v2:

With this in mind, we can use jina-reranker-m0 for a second pass in the retrieval chain, after initial results are retrieved from jina-clip-v2:
Stage 1: Retrieve candidates from both modalities
- Use jina-clip-v2 to get 16 documents via text search + 16 via image search
- Accept that we can't compare scores yet
Stage 2: Unified reranking
- Feed each (query + full document) pair into jina-reranker-m0
- The reranker processes both text AND image together
- Output: Single relevance score on a unified scale

We expanded the experiments from Table 1, now using jina-clip-v2 to retrieve documents from the corpus, then jina-reranker-m0 to rerank them:
- Retrieve 32 documents via query-to-text, then rerank based on query-to-text score.
- Retrieve 32 documents via query-to-image, then rerank based on query-to-image score.
- Retrieve 16 documents via query-to-text and 16 via query-to-image. Rerank based on query-to-text or query-to-image score, depending on query modality.
- Retrieve 16 documents via query-to-text and 16 via query-to-image. Rerank based on each document’s averaged query-to-text and query-to-image scores, giving a final score of (query-to-text + query-to-image)/2.
Experiment | Description | Recall@10 - with jina-clip-v2 | Recall@10 - with jina-reranker-m0 |
---|---|---|---|
1 | 32 docs: query-to-text | 14.55 | 17.42 |
2 | 32 docs: query-to-image | 22.38 | 28.94 |
3 | 16 docs: query-to-text 16 docs: query-to-image |
14.55 | 33.81 |
4 | 16 docs: query-to-text 16 docs: query-to-image Combined average reranker scores |
14.55 | 36.24 |
As we can see, by performing a second pass with jina-reranker-m0, recall goes up across the board, regardless of modality. However, we see the largest increase when we combine both textual and image content from the retrieved documents, hitting a recall@10 of 36.24. A visual example shows that jina-reranker-m0 consistently ranks the ground truth document first, whether searching text or image content:

top_k
of 1 result for each reranking methodology (four columns on the right), showing that combining image and text similarity scores consistently ranks ground truth document first.top_k
of 3 for the different retrieval methods, for reasons of space Figure 7 shows only top_k
of 1 for each query.tagConclusions
This simple two-stage approach delivers a 62% improvement in recall because the system finally leverages what humans do naturally: considering both what we read and what we see to determine relevance. The lesson extends beyond search: when dealing with multimodal AI systems, single-pass approaches that treat modalities separately will always hit this scoring incompatibility wall. Two-stage architectures that retrieve broadly then rank intelligently are becoming essential. Try jina-reranker-m0 via our API or on AWS, GCP and Azure.