

Today we're releasing jina-reranker-m0, our new multilingual multimodal reranker model for ranking visual documents across multiple languages: it accepts a query alongside a collection of visually rich document images, including pages with text, figures, tables, infographics, and various layouts across multiple domains and over 29 languages. It outputs a ranked list of documents ordered by their relevance to the query. Compared to jina-reranker-v2-base-multilingual, jina-reranker-m0 also improves text reranking for multilingual content, long documents, and code searching tasks.


tagNew Architecture
Unlike jina-reranker-v2-base-multilingual, jina-reranker-m0 shifts from the classic cross-encoder architecture to a decoder-only vision language model. It leverages the pretrained Qwen2-VL's vision encoder and projector, finetuned its LLM with LoRA, and post-trained a MLP to generate ranking logits that measure query-document relevance. This gives a discriminative model optimized for ranking tasks.
jina-reranker-m0 | jina-reranker-v2 |
|
---|---|---|
Architecture | Vision Language Model | Cross-Encoder |
Base model | Qwen2-VL-2B | Jina-XLM-RoBERTa |
Parameters | 2.4 B | 278 M |
Max context length (query + document) | 10,240 | 8,192 |
Max image patches (dynamic resolution) | 768 × 28 × 28 | ❌ |
Multilingual support | ✅ | ✅ |
Tasks supported | Text2Text, Text2Image, Image2Text, Text2Mixed | Text2Text |
This new architecture allows jina-reranker-m0 to handle up to 32K tokens, seamlessly combining both visual and textual inputs. The model supports images ranging from a minimum size of 56×56 pixels up to 4K resolution. When processing images, the ViT and projector condense adjacent 2×2 tokens into single visual tokens for LLM input. Special tokens such as <|vision_start|>
and <|vision_end|>
clearly mark visual token boundaries, enabling the language model to properly process visual information and perform sophisticated multimodal reasoning that integrates both visual and textual elements.
This architecture also effectively solves the modality gap problem that plagued earlier models like jina-clip-v1 and jina-clip-v2. Previously, images would cluster near other images while text would cluster near other text in the representation space, creating a disconnect. This meant that when your candidate documents contained both images and text, retrieving images using text queries was problematic. With jina-reranker-m0, you can now rank images and documents together without worrying about this gap, creating a truly unified multimodal search experience.
It's worth noting that our training was limited to a maximum of 10K input tokens, with up to 768 tokens per image (between <|vision_start|>
and <|vision_end|>
markers). Additionally, we didn't specifically train the model for image-to-image
, image-to-multimodal
, or text-to-multimodal
reranking tasks. In this context, "multimodal" refers to a single document containing both image and text tokens in the input. Looking at all possible combinations of image and text tokens in both queries and documents, we can summarize the full range of tasks supported by jina-reranker-m0 in the table below.
In our testing, we found some evidence suggesting the model can extrapolate to these untrained ranking tasks, but any effectiveness in these areas should be viewed as a result of the model's zero-shot transferability or unintended training side effects. We haven't conducted serious evaluations of the model's performance on these tasks, and plan to explore these capabilities more thoroughly in future research.
tagGetting Started
For a quick vibe-check, try our text-to-image rerank demo in the Search Foundation toolbox. We've prepared a collection of document images from our website, and you can also add your own image URLs. Simply type your query and press enter to see ranked results. You may retreat it either like OCR or content-based image retrieval - also feel free to try queries in non-English.
The demo is available at: https://jina.ai/api-dashboard/m0-image-rerank Please note that using this demo will consume your primary API key's tokens. Also the demo might seem a bit slow since it needs to download all images on the server from those URLs, and no cache is implemented for images.
tagVia API
The code below shows how to calculate relevance scores between the query "small language model data extraction"
and a collection of images and text documents. You can pass a text string, a base64-encoded image, or an image URL. New users can get a Jina API key with 1 million free tokens. While our API doesn't support using images as queries, you can use images as queries when accessing the model through the Hugging Face Transformers library.
curl -X POST \
https://api.jina.ai/v1/rerank \
-H "Content-Type: application/json" \
-H "Authorization: Bearer JINA_API_KEY" \
-d '{
"model": "jina-reranker-m0",
"query": "small language model data extraction",
"documents": [
{
"image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
},
{
"image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
},
{
"image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/wired-preview.png"
},
{
"text": "We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The models effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements."
},
{
"image": "https://jina.ai/blog-banner/using-deepseek-r1-reasoning-model-in-deepsearch.webp"
},
{
"text": "数据提取么?为什么不用正则啊,你用正则不就全解决了么?"
},
{
"text": "During the California Gold Rush, some merchants made more money selling supplies to miners than the miners made finding gold."
},
{
"text": "Die wichtigsten Beiträge unserer Arbeit sind zweifach: Erstens führen wir eine neuartige dreistufige Datensynthese-Pipeline namens Draft-Refine-Critique ein, die durch iterative Verfeinerung hochwertige Trainingsdaten generiert; und zweitens schlagen wir eine umfassende Trainingsstrategie vor, die kontinuierliches Vortraining zur Längenerweiterung, überwachtes Feintuning mit spezialisierten Kontrollpunkten, direkte Präferenzoptimierung (DPO) und iteratives Self-Play-Tuning kombiniert. Um die weitere Forschung und Anwendung der strukturierten Inhaltsextraktion zu erleichtern, ist das Modell auf Hugging Face öffentlich verfügbar."
}
],
"return_documents": false
}'
The response is shown below, where the first result index=1
corresponds to the our ReaderLM-v2 paper screenshot.
{"model":"jina-reranker-m0","usage":{"total_tokens":2829},"results":[{"index":1,"relevance_score":0.9587112551898949},{"index":3,"relevance_score":0.9337408271911014},{"index":7,"relevance_score":0.8922925217195924},{"index":2,"relevance_score":0.8891905997562045},{"index":0,"relevance_score":0.8827516945848907},{"index":4,"relevance_score":0.8701035914834407},{"index":6,"relevance_score":0.8676828987527296},{"index":5,"relevance_score":0.8455347349164652}]}
tagVia CSP Marketplaces
jina-reranker-m0 will be soon available directly on AWS, Azure and GCP at the prices listed there.

tagVia HuggingFace
You can also use the model locally from our Hugging Face page. We've prepared a Google Colab notebook that demonstrates how it works. Compared to our web API, using the model locally offers greater flexibility, such as the ability to use images as queries and work with multimodal documents.

tagConclusion
jina-reranker-m0 is our first attempt to unify textual and visual modalities in a single decoder-only model. This new architecture incorporates lessons learned from our previous encoder-only retrieval models, including jina-clip-v2, jina-embeddings-v3, jina-reranker-v2-base-multilingual and jina-embeddings-v2-base-code.
The new model not only unlocks capabilities for multimodal retrieval tasks, such as text-to-image reranking and visual document reranking, but also demonstrates improved performance compared to jina-reranker-v2-base-multilingual on text-to-text and text-to-code reranking tasks. We designate this new model series as the "m-series" to highlight its multimodal nature.
When comparing jina-reranker-m0 with jina-reranker-v2-base-multilingual, our goal for the m-series is to achieve multimodality while improving performance on text-only tasks at a level comparable to specialized text-only models. Some might question the value of using an 8x larger model if the performance improvement on text-only tasks appears marginal. While it's true for the moment that m0
may not provide substantial added value over v2
for text-only applications, the decoder-only architecture opens up many new possibilities that weren't achievable with encoder-only architectures, including:
- True mixed-modality reranking
- Listwise reranking and documents deduplication
- Ranking score explainability via attention mechanism
Our future work will focus on further upgrading the text-only reranker and fully leveraging the new features enabled by this multimodal architecture to achieve better and wider search.
tagAppendix: Evaluation
Full evaluations can be found in this Google Spreadsheet.
tagBEIR (Text2Text, English-only)

BEIR is a heterogeneous benchmark for information retrieval, designed to evaluate the versatility and robustness of IR models. It contains a diverse set of datasets from various domains and focuses on zero-shot evaluation. Standardized evaluation metrics such as NDCG, Recall@K, and MRR are used.
Model | AVG (NDCG@10) | TREC-COVID | NFCorpus | NQ | HotpotQA | FiQA | ArguAna | Touche-2020 | DBPedia | SCIDOCS | FEVER | Climate-FEVER | SciFact | Quora |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
jina-reranker-m0 | 58.95 | 84.17 | 41.03 | 72.25 | 76.99 | 51.62 | 40.69 | 31.79 | 49.34 | 22.91 | 91.14 | 36.42 | 79.94 | 88.01 |
jina-embeddings-v3 (1024 tokens) | 55.81 | 77.81 | 36.65 | 64.31 | 64.63 | 47.47 | 54.31 | 26.55 | 41.07 | 19.91 | 89.00 | 42.33 | 72.4 | 89.06 |
bge-reranker-v2-m3 | 56.51 | 82.19 | 34.33 | 69.52 | 77.89 | 45.45 | 36.21 | 33.12 | 46.72 | 17.79 | 91.03 | 38.69 | 72.64 | 89.10 |
jina-reranker-v2-multilingual | 57.06 | 80.53 | 37.17 | 67.39 | 76.17 | 46.48 | 39.28 | 32.35 | 47.81 | 20.03 | 93.02 | 37.17 | 76.50 | 87.83 |
tagMIRACL (Text2Text, Multilingual, 18 languages)

MIRACL is a large-scale multilingual dataset for ad hoc information retrieval across 18 languages. It encompasses over three billion native speakers and features thorough human annotations. The focus is on monolingual retrieval tasks.
Model | AVG (NDCG@10) | ar | bn | en | es | fa | fi | fr | hi | id | ja | ko | ru | sw | te | th | zh | de | yo |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
jina-reranker-m0 | 66.75 | 79.78 | 78.01 | 59.21 | 53.56 | 58.80 | 78.00 | 56.66 | 62.83 | 54.92 | 66.51 | 72.86 | 67.26 | 59.04 | 70.19 | 80.37 | 64.51 | 58.50 | 80.44 |
jina-embeddings-v3 (8192 tokens) | 58.90 | 71.53 | 69.86 | 48.37 | 46.91 | 54.13 | 71.15 | 50.90 | 55.05 | 47.83 | 56.46 | 64.76 | 55.63 | 54.07 | 70.48 | 73.56 | 55.29 | 49.18 | 65.01 |
bge-reranker-v2-m3 | 69.32 | 80.51 | 81.85 | 57.67 | 57.64 | 61.92 | 80.38 | 59.60 | 67.66 | 58.86 | 67.37 | 75.14 | 67.61 | 68.92 | 76.69 | 82.29 | 64.46 | 58.32 | 80.85 |
jina-reranker-v2-multilingual | 63.65 | 72.50 | 79.42 | 46.66 | 51.54 | 57.81 | 73.05 | 50.90 | 60.94 | 56.66 | 59.15 | 72.60 | 53.43 | 66.47 | 74.62 | 77.75 | 62.49 | 53.06 | 76.69 |
tagMLDR (Text2Text, Multilingual Long Documents, 13 languages)

MLDR is a multilingual dataset specifically designed for long-document retrieval, covering 13 languages. It utilizes GPT-3.5 to generate questions for the documents. The dataset is built on Wikipedia, Wudao and mC4.
Model | AVG (NDCG@10) | ar | de | en | es | fr | hi | it | ja | ko | pt | ru | th | zh |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
jina-reranker-m0 | 59.83 | 55.86 | 51.25 | 54.67 | 87.63 | 82.59 | 32.76 | 73.25 | 58.93 | 55.73 | 86.08 | 66.73 | 39.17 | 33.14 |
jina-embeddings-v3 (8192 tokens) | 39.71 | 28.44 | 31.57 | 29.07 | 62.08 | 59.79 | 25.47 | 53.72 | 38.36 | 32.37 | 63.26 | 49.65 | 25.15 | 17.26 |
bge-reranker-v2-m3 | 53.53 | 49.19 | 45.39 | 43.92 | 74.57 | 68.67 | 44.75 | 62.79 | 49.27 | 48.24 | 76.45 | 62.84 | 38.82 | 31.02 |
jina-reranker-v2-multilingual | 59.50 | 51.96 | 50.13 | 46.85 | 86.34 | 82.25 | 49.50 | 69.00 | 59.07 | 52.19 | 85.26 | 68.06 | 38.73 | 34.15 |
tagMKQA (Text2Text, Multilingual Question-Answering, 24 languages, 3 variants for Chinese)

MKQA is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages. The question-answer pairs are sampled from Google Natural Questions.
Model | AVG (recall@10) | ar | da | de | es | en | fi | fr | he | hu | it | ja | km | ko | ms | nl | no | pl | pt | ru | sv | th | tr | vi | zh_cn | zh_hk | zh_tw |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
jina-reranker-m0 | 68.19 | 63.88 | 70.57 | 70.52 | 71.26 | 73.47 | 64.10 | 71.11 | 63.68 | 63.23 | 70.30 | 69.13 | 50.43 | 64.30 | 70.78 | 71.73 | 70.25 | 69.72 | 70.57 | 70.78 | 70.69 | 69.80 | 67.90 | 69.68 | 69.12 | 68.23 | 67.79 |
jina-embeddings-v3 (8192 tokens) | 65.63 | 59.00 | 69.12 | 68.27 | 68.15 | 71.14 | 65.66 | 68.30 | 59.51 | 63.23 | 68.30 | 64.36 | 56.13 | 58.98 | 68.30 | 69.53 | 68.65 | 67.26 | 67.93 | 67.06 | 68.68 | 66.32 | 66.97 | 66.87 | 63.38 | 63.59 | 61.55 |
bge-reranker-v2-m3 | 67.88 | 63.09 | 70.15 | 68.91 | 68.92 | 73.00 | 68.71 | 68.71 | 70.27 | 64.00 | 68.15 | 68.47 | 60.43 | 63.95 | 68.80 | 70.77 | 69.10 | 67.44 | 67.40 | 69.77 | 70.03 | 69.68 | 66.04 | 68.29 | 67.84 | 66.70 | 66.34 |
jina-reranker-v2-multilingual | 67.90 | 63.88 | 70.31 | 70.09 | 70.51 | 73.09 | 67.50 | 70.38 | 63.00 | 64.59 | 69.90 | 67.34 | 57.79 | 62.14 | 70.36 | 71.58 | 69.51 | 68.61 | 70.13 | 70.07 | 70.15 | 68.80 | 68.02 | 69.39 | 67.23 | 65.77 | 65.37 |
tagCoIR (Text2Text, Code Information Retrieval)

CoIR is a comprehensive benchmark designed to evaluate models’ abilities in code retrieval. It includes 10 curated code datasets covering 8 retrieval tasks across 7 diverse domains. A Python framework is provided for this benchmark.
Model Name | Avg (NDCG@10) | Text-to-Code | Code-to-Text | Code-to-Code | Hybrid Code | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Apps | CosQA | SQL | CSN | CSN-CCR | CodeTransOcean | StackOver Flow |
CodeFeedBack | ||||||||||||||||
AVG | python | javascript | go | ruby | java | php | AVG | python | javascript | go | ruby | java | php | -Contest | -DL | -MT | -ST | ||||||
jina-reranker-m0 | 63.55 | 26.21 | 37.75 | 57.92 | 80.76 | 98.37 | 71.16 | 86.14 | 72.74 | 79.02 | 77.14 | 74.57 | 81.66 | 77.92 | 68.71 | 75.44 | 77.54 | 66.13 | 79.79 | 31.89 | 90.41 | 72.25 | 83.95 |
jina-embeddings-v2-base-code (top 100) |
56.90 | 16.34 | 41.72 | 49.79 | 83.95 | 94.71 | 76.35 | 87.39 | 78.23 | 82.69 | 84.35 | 59.65 | 68.23 | 62.31 | 49.15 | 65.40 | 63.89 | 48.92 | 79.20 | 30.35 | 89.42 | 49.62 | 68.93 |
bge-reranker-v2-m3 | 35.97 | 8.33 | 30.06 | 50.63 | 49.26 | 67.62 | 39.55 | 58.11 | 41.37 | 44.77 | 44.13 | 40.81 | 42.57 | 42.75 | 38.04 | 38.04 | 41.73 | 41.73 | 34.93 | 5.09 | 60.12 | 16.44 | 64.05 |
jina-reranker-v2-multilingual | 56.14 | 21.90 | 37.26 | 53.56 | 78.88 | 97.83 | 67.43 | 84.64 | 68.93 | 75.73 | 78.71 | 63.59 | 72.62 | 67.80 | 55.07 | 67.25 | 64.25 | 54.54 | 73.67 | 25.74 | 91.24 | 42.03 | 73.59 |
tagViDoRe (Text2Image, Visual Document Retrieval Benchmark)

ViDoRe is a benchmark designed to evaluate retrieval systems on their capacity to match queries to relevant documents using visual features. It covers various page-level retrieving tasks across multiple domains and languages. The benchmark focuses on visual elements of documents.
Model Name | AVG (NDCG@5) |
TAT-DQA | Shift Project |
Artificial Intelligence |
Government Reports |
ArxivQA | DocVQA | Healthcare Industry |
InfoVQA | Energy | TabFQuad |
---|---|---|---|---|---|---|---|---|---|---|---|
jina-reranker-m0 | 91.02 | 81.83 | 93.22 | 99.63 | 97.59 | 89.82 | 62.58 | 99.26 | 92.88 | 96.06 | 97.32 |
MrLight/dse-qwen2-2b-mr1-v1 | 84.48 | 66.64 | 79.39 | 96.45 | 95.30 | 84.53 | 55.47 | 96.85 | 86.39 | 91.80 | 92.03 |
MonoQwen2-VL-v0.1 | 87.64 | 79.50 | 76.38 | 98.39 | 93.63 | 89.50 | 57.47 | 98.39 | 92.12 | 95.29 | 95.75 |
tagM-BEIR (Text2Image, Image2Text, Multimodal BEnchmark for Instructed Retrieval)

M-BEIR is a comprehensive large-scale retrieval benchmark designed to train and evaluate multimodal retrieval models. It comprises eight multimodal retrieval tasks and ten datasets from a variety of domains and sources. The benchmark focuses on instruction-following retrieval.
Model | MBEIR t2i VisualNews Recall@5 |
MBEIR t2i MSCOCO Recall@5 |
MBEIR t2i Fashion200K Recall@10 |
MBEIR i2t VisualNews Recall@5 |
MBEIR i2t MSCOCO Recall@5 |
MBEIR i2t Fashion200K Recall@10 |
---|---|---|---|---|---|---|
jina-reranker-m0 | 23.89 | 72.19 | 9.79 | 17.61 | 41.21 | 11.56 |
jinaai/jina-clip-v2 | 15.42 | 52.28 | 7.03 | 11.63 | 28.80 | 8.78 |
MonoQwen2-VL-v0.1 | 22.74 | 71.29 | 10.00 | 15.08 | 42.24 | 11.25 |
tagWinoground (Text2Text, Text2Image)

Winoground is a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. It uses twin captions with identical word content and employs contrastive image-caption pairs. The focus is on compositional reasoning.
Model | Text | Image | Group | Avg |
---|---|---|---|---|
jina-reranker-m0 | 57.00 | 40.75 | 34.00 | 43.92 |
MrLight/dse-qwen2-2b-mrl-v1 | 7.50 | 9.25 | 1.75 | 6.17 |
MonoQwen2-VL-v0.1 | 52.00 | 36.25 | 31.50 | 39.92 |
Winoground evaluates vision-language models using three key metrics: Text Score, Image Score, and Group Score. The Text Score measures if a model correctly matches captions to images, while the Image Score assesses if it selects the right image for a caption. The Group Score, the most rigorous metric, requires all caption-image relationships to be correctly identified. Scores are percentages representing accuracy rates, with higher values indicating better reasoning abilities.