News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
copyright

jina-reranker-m0

Multilingual multimodal reranker model for ranking visual documents
Release Postarrow_forward
License
copyright
CC-BY-NC-4.0
Release Date
calendar_month
2025-04-08
Input
abc
Text (Query)
image
Image (Query)
abc
Text (Document)
image
Image (Document)
arrow_forward
Output
format_list_numbered
Rankings
Model Details
Parameters: 2.4B
Input Token Length: 10K
Language Support
🌍 Multilingual support
Related Models
link
jina-reranker-v2-base-multilingual
Tags
multimodal
multilingual
code-search
long-context
reranker
vlm
decoder-only
Available via
Jina APICommercial LicenseAWS SageMakerMicrosoft AzureGoogle CloudHugging Face
I/O graph 1
I/O graph 2
I/O graph 3
I/O graph 4
I/O graph 5
I/O graph 6
Choose models to compare

Overview

jina-reranker-m0 is a groundbreaking multimodal multilingual reranker model designed to rank visual documents across multiple languages. What makes this model exceptional is its ability to process queries alongside visually rich document images—including pages with text, figures, tables, and various layouts—across 29 languages. The model outputs a ranked list of documents ordered by their relevance to the input query. Unlike previous rerankers that struggled with the "modality gap" problem (where images clustered near other images while text clustered near text), jina-reranker-m0 unifies textual and visual modalities in a single decoder-only model, creating a seamless multimodal search experience that can rank both images and text documents together effectively.

Methods

The architecture of jina-reranker-m0 represents a significant departure from previous approaches. Built upon Qwen2-VL-2B with 2.1 billion parameters, it shifts from a classic cross-encoder architecture to a decoder-only vision language model. The system leverages Qwen2-VL's pretrained vision encoder and projector, finetunes its large language model with LoRA (Low-Rank Adaptation), and employs a post-trained MLP to generate ranking logits that measure query-document relevance. This discriminative model can handle up to 32K tokens and supports images from 56Ă—56 pixels up to 4K resolution. When processing images, the Vision Transformer (ViT) and projector condense adjacent 2Ă—2 tokens into single visual tokens, while special tokens clearly mark visual token boundaries, enabling the language model to properly integrate and reason across both visual and textual elements.

Performance

Jina-reranker-m0 achieves impressive results across multiple benchmarks. In text-to-text reranking, it scores 58.95 NDCG-10 on the BEIR benchmark, outperforming competitors like jina-embeddings-v3 (55.81) and bge-reranker-v2-m3 (56.51). For multilingual content, it achieves 66.75 NDCG-10 on the MIRACL benchmark covering 18 languages. On the MLDR benchmark for long documents, it scores 59.83 NDCG-10 across 13 languages. For code retrieval on the CoIR benchmark, it achieves 63.55 NDCG-10, significantly outperforming competitors. But the model truly shines in visual document retrieval—on the ViDoRe benchmark, it scores an impressive 91.02 NDCG-5, while on Winoground, which tests visio-linguistic compositional reasoning, it achieves 43.92 average score, demonstrating its superior ability to understand relationships between text and images compared to other models.

Best Practice

To maximize the potential of jina-reranker-m0, developers should consider several implementation strategies. The model is accessible via API, cloud service marketplaces (AWS, Azure, GCP), or locally through Hugging Face. When using the API, developers can pass either text strings, base64 images, or image URLs, with new users eligible for ten millions free tokens. While the model performs exceptionally well on text-to-text, text-to-image, image-to-text, and text-to-mixed-unimodal tasks thanks to extensive training, it's worth noting that some combinations (like image-to-image) are supported in a zero-shot manner without specific training. For optimal results, remember that the model supports up to 10K input tokens with up to 768 tokens per image. The architecture's decoder-only approach opens possibilities beyond simple reranking, including true mixed-modality reranking, listwise reranking, document deduplication, and ranking score explainability via attention mechanisms—capabilities that weren't achievable with previous encoder-only architectures.
Blogs that mention this model
April 08, 2025 • 21 minutes read
jina-reranker-m0: Multilingual Multimodal Document Reranker
Introducing jina-reranker-m0, our new multilingual multimodal reranker for retrieving visual documents, with SOTA performance on multilingual long documents and code searching tasks.
Jina AI
Modern dot matrix text display on a dark blue background, conveying a digital feel.
May 25, 2025 • 8 minutes read
Fair Scoring for Multimodal Documents with jina-reranker-m0
Text similarity: 0.7. Image similarity: 0.5. Which document is more relevant? You literally cannot tell—and that's the core problem breaking multimodal search. We solve it with unified reranking.
Nan Wang
Alex C-G
Stacked glowing green ovals on a background transitioning from black to green, with the top oval having an unusual, split sha
April 16, 2025 • 10 minutes read
On the Size Bias of Text Embeddings and Its Impact in Search
Size bias refers to how the length of text inputs affects similarity, regardless of semantic relevance. It explains why search systems sometimes return long, barely-relevant documents instead of shorter, more precise matches to your query.
Scott Martens
Black background with a simple white ruler marked in centimeters, emphasizing a minimalist design.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
PrinzessinnenstraĂźe 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.