I/O graph 1
I/O graph 2
I/O graph 3
I/O graph 4
I/O graph 5
I/O graph 6
Choose models to compare
Overview
jina-reranker-m0 is a groundbreaking multimodal multilingual reranker model designed to rank visual documents across multiple languages. What makes this model exceptional is its ability to process queries alongside visually rich document images—including pages with text, figures, tables, and various layouts—across 29 languages. The model outputs a ranked list of documents ordered by their relevance to the input query. Unlike previous rerankers that struggled with the "modality gap" problem (where images clustered near other images while text clustered near text), jina-reranker-m0 unifies textual and visual modalities in a single decoder-only model, creating a seamless multimodal search experience that can rank both images and text documents together effectively.
Methods
The architecture of jina-reranker-m0 represents a significant departure from previous approaches. Built upon Qwen2-VL-2B with 2.1 billion parameters, it shifts from a classic cross-encoder architecture to a decoder-only vision language model. The system leverages Qwen2-VL's pretrained vision encoder and projector, finetunes its large language model with LoRA (Low-Rank Adaptation), and employs a post-trained MLP to generate ranking logits that measure query-document relevance. This discriminative model can handle up to 32K tokens and supports images from 56Ă—56 pixels up to 4K resolution. When processing images, the Vision Transformer (ViT) and projector condense adjacent 2Ă—2 tokens into single visual tokens, while special tokens clearly mark visual token boundaries, enabling the language model to properly integrate and reason across both visual and textual elements.
Performance
Jina-reranker-m0 achieves impressive results across multiple benchmarks. In text-to-text reranking, it scores 58.95 NDCG-10 on the BEIR benchmark, outperforming competitors like jina-embeddings-v3 (55.81) and bge-reranker-v2-m3 (56.51). For multilingual content, it achieves 66.75 NDCG-10 on the MIRACL benchmark covering 18 languages. On the MLDR benchmark for long documents, it scores 59.83 NDCG-10 across 13 languages. For code retrieval on the CoIR benchmark, it achieves 63.55 NDCG-10, significantly outperforming competitors. But the model truly shines in visual document retrieval—on the ViDoRe benchmark, it scores an impressive 91.02 NDCG-5, while on Winoground, which tests visio-linguistic compositional reasoning, it achieves 43.92 average score, demonstrating its superior ability to understand relationships between text and images compared to other models.
Best Practice
To maximize the potential of jina-reranker-m0, developers should consider several implementation strategies. The model is accessible via API, cloud service marketplaces (AWS, Azure, GCP), or locally through Hugging Face. When using the API, developers can pass either text strings, base64 images, or image URLs, with new users eligible for 1 million free tokens. While the model performs exceptionally well on text-to-text, text-to-image, image-to-text, and text-to-mixed-unimodal tasks thanks to extensive training, it's worth noting that some combinations (like image-to-image) are supported in a zero-shot manner without specific training. For optimal results, remember that the model supports up to 10K input tokens with up to 768 tokens per image. The architecture's decoder-only approach opens possibilities beyond simple reranking, including true mixed-modality reranking, listwise reranking, document deduplication, and ranking score explainability via attention mechanisms—capabilities that weren't achievable with previous encoder-only architectures.
Blogs that mention this model