jina-reranker-m0: Multilingual Multimodal Document Reranker

Today we're releasing jina-reranker-m0, our new multilingual multimodal reranker model for ranking visual documents across multiple languages: it accepts a query alongside a collection of visually rich document images, including pages with text, figures, tables, infographics, and various layouts across multiple domains and over 29 languages. It outputs a ranked list of documents ordered by their relevance to the query. Compared to jina-reranker-v2-base-multilingual, jina-reranker-m0 also improves text reranking for multilingual content, long documents, and code searching tasks.

Comparative evaluation chart of several models on multiple benchmark datasets, indicating varying performance levels. — Performance of the jina-reranker-m0 on ViDoRe, MBEIR, and Winoground visual retrieval benchmarks showcases its capabilities across diverse multimodal retrieval tasks spanning multiple domains and languages. Each dot represents performance scores for different types/tasks of visual documents. The boxplots illustrate the distribution of these scores, with the highlighted numbers indicating the average (mean) performance. For complete benchmark results, please refer to the appendix of this post.

Comparison chart of Jina model versions with performance metrics across BEIR, MLDR, MKQA, and CodelR benchmarks. — This boxplot displays the performance of jina-reranker-m0 across five text-only reranking benchmarks. Each benchmark may include multiple datasets, languages, or tasks, represented by individual dots inside the boxplot. The boxplot shows the distribution of these scores, with the highlighted number showing the average (mean) performance. While most benchmarks use NDCG@10 as their performance metric, MKQA uses recall@10 instead, as MKQA's annotation data doesn't support NDCG calculation (the official evaluation uses recall, which determines document relevance through heuristics). Complete benchmark results are available in the appendix of this post.

tagNew Architecture

The architecture of jina-reranker-m0 is built upon Qwen2-VL-2B and consists of 2.4 billion parameters. This model efficiently ranks documents by evaluating both their visual and textual elements in relation to queries, using pairwise comparison.

Unlike jina-reranker-v2-base-multilingual, jina-reranker-m0 shifts from the classic cross-encoder architecture to a decoder-only vision language model. It leverages the pretrained Qwen2-VL's vision encoder and projector, finetuned its LLM with LoRA, and post-trained a MLP to generate ranking logits that measure query-document relevance. This gives a discriminative model optimized for ranking tasks.

	jina-reranker-m0	`jina-reranker-v2`
Architecture	Vision Language Model	Cross-Encoder
Base model	Qwen2-VL-2B	Jina-XLM-RoBERTa
Parameters	2.4 B	278 M
Max context length (query + document)	10,240	8,192
Max image patches (dynamic resolution)	768 × 28 × 28	❌
Multilingual support	✅	✅
Tasks supported	Text2Text, Text2Image, Image2Text, Text2Mixed	Text2Text

This new architecture allows jina-reranker-m0 to handle up to 32K tokens, seamlessly combining both visual and textual inputs. The model supports images ranging from a minimum size of 56×56 pixels up to 4K resolution. When processing images, the ViT and projector condense adjacent 2×2 tokens into single visual tokens for LLM input. Special tokens such as <|vision_start|> and <|vision_end|> clearly mark visual token boundaries, enabling the language model to properly process visual information and perform sophisticated multimodal reasoning that integrates both visual and textual elements.

This architecture also effectively solves the modality gap problem that plagued earlier models like jina-clip-v1 and jina-clip-v2. Previously, images would cluster near other images while text would cluster near other text in the representation space, creating a disconnect. This meant that when your candidate documents contained both images and text, retrieving images using text queries was problematic. With jina-reranker-m0, you can now rank images and documents together without worrying about this gap, creating a truly unified multimodal search experience.

Modality gap chart comparing text-to-text and text-to-image similarities for Jina-clip-v2 and Frei with frequency details. — In multimodal retrieval systems, a "modality gap" refers to the difference in how the model scores text-to-text similarity versus text-to-image similarity. Looking at left-image (jina-clip-v2), there's a clear separation between the two distributions: The text-to-text similarity distribution (red) peaks around 0.35. The text-to-image similarity (blue) peaks around 0.65-0.7. This significant separation indicates a large modality gap - the model scores text-to-text and text-to-image pairs in fundamentally different ranges. This makes it difficult to directly compare scores across modalities. In a system without a modality gap (e.g. `, we would expect the distributions to largely overlap, meaning that the model scores both types of pairs in similar ranges based purely on relevance, not on modality type.

It's worth noting that our training was limited to a maximum of 10K input tokens, with up to 768 tokens per image (between <|vision_start|> and <|vision_end|> markers). Additionally, we didn't specifically train the model for image-to-image, image-to-multimodal, or text-to-multimodal reranking tasks. In this context, "multimodal" refers to a single document containing both image and text tokens in the input. Looking at all possible combinations of image and text tokens in both queries and documents, we can summarize the full range of tasks supported by jina-reranker-m0 in the table below.

Color-coded chart detailing RERANK TASKS SUPPORT for various interactive modalities with "Zero-shot: No API yet" entries. — jina-reranker-m0 supports a wide range of query and document input combinations for reranking purposes. It achieves state-of-the-art performance in text-to-text, text-to-image, image-to-text, and text-to-mixed-unimodal tasks, thanks to extensive training. The model also handles other input combinations in a zero-shot manner - the architecture accommodates these token combinations, though we haven't specifically trained for these tasks.

In our testing, we found some evidence suggesting the model can extrapolate to these untrained ranking tasks, but any effectiveness in these areas should be viewed as a result of the model's zero-shot transferability or unintended training side effects. We haven't conducted serious evaluations of the model's performance on these tasks, and plan to explore these capabilities more thoroughly in future research.

tagGetting Started

For a quick vibe-check, try our text-to-image rerank demo in the Search Foundation toolbox. We've prepared a collection of document images from our website, and you can also add your own image URLs. Simply type your query and press enter to see ranked results. You may retreat it either like OCR or content-based image retrieval - also feel free to try queries in non-English.

0:00

/0:22

The demo is available at: https://jina.ai/api-dashboard/m0-image-rerank Please note that using this demo will consume your primary API key's tokens. Also the demo might seem a bit slow since it needs to download all images on the server from those URLs, and no cache is implemented for images.

tagVia API

The code below shows how to calculate relevance scores between the query "small language model data extraction" and a collection of images and text documents. You can pass a text string, a base64-encoded image, or an image URL. New users can get a Jina API key with 1 million free tokens. While our API doesn't support using images as queries, you can use images as queries when accessing the model through the Hugging Face Transformers library.

curl -X POST \
  https://api.jina.ai/v1/rerank \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer JINA_API_KEY" \
  -d '{
  "model": "jina-reranker-m0",
  "query": "small language model data extraction",
  "documents": [
    {
      "image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/handelsblatt-preview.png"
    },
    {
      "image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/paper-11.png"
    },
    {
      "image": "https://raw.githubusercontent.com/jina-ai/multimodal-reranker-test/main/wired-preview.png"
    },
    {
      "text": "We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The models effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements."
    },
    {
      "image": "https://jina.ai/blog-banner/using-deepseek-r1-reasoning-model-in-deepsearch.webp"
    },
    {
      "text": "数据提取么？为什么不用正则啊，你用正则不就全解决了么？"
    },
    {
      "text": "During the California Gold Rush, some merchants made more money selling supplies to miners than the miners made finding gold."
    },
    {
      "text": "Die wichtigsten Beiträge unserer Arbeit sind zweifach: Erstens führen wir eine neuartige dreistufige Datensynthese-Pipeline namens Draft-Refine-Critique ein, die durch iterative Verfeinerung hochwertige Trainingsdaten generiert; und zweitens schlagen wir eine umfassende Trainingsstrategie vor, die kontinuierliches Vortraining zur Längenerweiterung, überwachtes Feintuning mit spezialisierten Kontrollpunkten, direkte Präferenzoptimierung (DPO) und iteratives Self-Play-Tuning kombiniert. Um die weitere Forschung und Anwendung der strukturierten Inhaltsextraktion zu erleichtern, ist das Modell auf Hugging Face öffentlich verfügbar."
    }
  ],
  "return_documents": false
}'

The response is shown below, where the first result index=1 corresponds to the our ReaderLM-v2 paper screenshot.

{"model":"jina-reranker-m0","usage":{"total_tokens":2829},"results":[{"index":1,"relevance_score":0.9587112551898949},{"index":3,"relevance_score":0.9337408271911014},{"index":7,"relevance_score":0.8922925217195924},{"index":2,"relevance_score":0.8891905997562045},{"index":0,"relevance_score":0.8827516945848907},{"index":4,"relevance_score":0.8701035914834407},{"index":6,"relevance_score":0.8676828987527296},{"index":5,"relevance_score":0.8455347349164652}]}

tagVia CSP Marketplaces

jina-reranker-m0 will be soon available directly on AWS, Azure and GCP at the prices listed there.

Microsoft Azure Marketplace

Google Cloud console

Spend smart, procure faster and retire committed Google Cloud spend with Google Cloud Marketplace. Browse the catalog of over 2000 SaaS, VMs, development stacks, and Kubernetes apps optimized to run on Google Cloud.

tagVia HuggingFace

You can also use the model locally from our Hugging Face page. We've prepared a Google Colab notebook that demonstrates how it works. Compared to our web API, using the model locally offers greater flexibility, such as the ability to use images as queries and work with multimodal documents.

tagConclusion

jina-reranker-m0 is our first attempt to unify textual and visual modalities in a single decoder-only model. This new architecture incorporates lessons learned from our previous encoder-only retrieval models, including jina-clip-v2, jina-embeddings-v3, jina-reranker-v2-base-multilingual and jina-embeddings-v2-base-code.

The new model not only unlocks capabilities for multimodal retrieval tasks, such as text-to-image reranking and visual document reranking, but also demonstrates improved performance compared to jina-reranker-v2-base-multilingual on text-to-text and text-to-code reranking tasks. We designate this new model series as the "m-series" to highlight its multimodal nature.

When comparing jina-reranker-m0 with jina-reranker-v2-base-multilingual, our goal for the m-series is to achieve multimodality while improving performance on text-only tasks at a level comparable to specialized text-only models. Some might question the value of using an 8x larger model if the performance improvement on text-only tasks appears marginal. While it's true for the moment that m0 may not provide substantial added value over v2 for text-only applications, the decoder-only architecture opens up many new possibilities that weren't achievable with encoder-only architectures, including:

True mixed-modality reranking
Listwise reranking and documents deduplication
Ranking score explainability via attention mechanism

Our future work will focus on further upgrading the text-only reranker and fully leveraging the new features enabled by this multimodal architecture to achieve better and wider search.

tagAppendix: Evaluation

Full evaluations can be found in this Google Spreadsheet.

tagBEIR (Text2Text, English-only)

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.

arXiv.orgNandan Thakur

BEIR is a heterogeneous benchmark for information retrieval, designed to evaluate the versatility and robustness of IR models. It contains a diverse set of datasets from various domains and focuses on zero-shot evaluation. Standardized evaluation metrics such as NDCG, Recall@K, and MRR are used.

Model	AVG (NDCG@10)	TREC-COVID	NFCorpus	NQ	HotpotQA	FiQA	ArguAna	Touche-2020	DBPedia	SCIDOCS	FEVER	Climate-FEVER	SciFact	Quora
jina-reranker-m0	58.95	84.17	41.03	72.25	76.99	51.62	40.69	31.79	49.34	22.91	91.14	36.42	79.94	88.01
jina-embeddings-v3 (1024 tokens)	55.81	77.81	36.65	64.31	64.63	47.47	54.31	26.55	41.07	19.91	89.00	42.33	72.4	89.06
bge-reranker-v2-m3	56.51	82.19	34.33	69.52	77.89	45.45	36.21	33.12	46.72	17.79	91.03	38.69	72.64	89.10
jina-reranker-v2-multilingual	57.06	80.53	37.17	67.39	76.17	46.48	39.28	32.35	47.81	20.03	93.02	37.17	76.50	87.83

tagMIRACL (Text2Text, Multilingual, 18 languages)

Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages

MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc retrieval across 18 different languages, which collectively encompass over three billion native speakers around the world. These languages have diverse typologies, originate from many different language families, and are associated with varying amounts of available resources -- including what researchers typically characterize as high-resource as well as low-resource languages. Our dataset is designed to support the creation and evaluation of models for monolingual retrieval, where the queries and the corpora are in the same language. In total, we have gathered over 700k high-quality relevance judgments for around 77k queries over Wikipedia in these 18 languages, where all assessments have been performed by native speakers hired by our team. Our goal is to spur research that will improve retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have been traditionally underserved. This overview paper describes the dataset and baselines that we share with the community. The MIRACL website is live at http://miracl.ai/.

arXiv.orgXinyu Zhang

MIRACL is a large-scale multilingual dataset for ad hoc information retrieval across 18 languages. It encompasses over three billion native speakers and features thorough human annotations. The focus is on monolingual retrieval tasks.

Model	AVG (NDCG@10)	ar	bn	en	es	fa	fi	fr	hi	id	ja	ko	ru	sw	te	th	zh	de	yo
jina-reranker-m0	66.75	79.78	78.01	59.21	53.56	58.80	78.00	56.66	62.83	54.92	66.51	72.86	67.26	59.04	70.19	80.37	64.51	58.50	80.44
jina-embeddings-v3 (8192 tokens)	58.90	71.53	69.86	48.37	46.91	54.13	71.15	50.90	55.05	47.83	56.46	64.76	55.63	54.07	70.48	73.56	55.29	49.18	65.01
bge-reranker-v2-m3	69.32	80.51	81.85	57.67	57.64	61.92	80.38	59.60	67.66	58.86	67.37	75.14	67.61	68.92	76.69	82.29	64.46	58.32	80.85
jina-reranker-v2-multilingual	63.65	72.50	79.42	46.66	51.54	57.81	73.05	50.90	60.94	56.66	59.15	72.60	53.43	66.47	74.62	77.75	62.49	53.06	76.69

tagMLDR (Text2Text, Multilingual Long Documents, 13 languages)

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.

arXiv.orgJianlv Chen

MLDR is a multilingual dataset specifically designed for long-document retrieval, covering 13 languages. It utilizes GPT-3.5 to generate questions for the documents. The dataset is built on Wikipedia, Wudao and mC4.

Model	AVG (NDCG@10)	ar	de	en	es	fr	hi	it	ja	ko	pt	ru	th	zh
jina-reranker-m0	59.83	55.86	51.25	54.67	87.63	82.59	32.76	73.25	58.93	55.73	86.08	66.73	39.17	33.14
jina-embeddings-v3 (8192 tokens)	39.71	28.44	31.57	29.07	62.08	59.79	25.47	53.72	38.36	32.37	63.26	49.65	25.15	17.26
bge-reranker-v2-m3	53.53	49.19	45.39	43.92	74.57	68.67	44.75	62.79	49.27	48.24	76.45	62.84	38.82	31.02
jina-reranker-v2-multilingual	59.50	51.96	50.13	46.85	86.34	82.25	49.50	69.00	59.07	52.19	85.26	68.06	38.73	34.15

tagMKQA (Text2Text, Multilingual Question-Answering, 24 languages, 3 variants for Chinese)

MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

Progress in cross-lingual modeling depends on challenging, realistic, and diverse evaluation sets. We introduce Multilingual Knowledge Questions and Answers (MKQA), an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). Answers are based on a heavily curated, language-independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark a variety of state-of-the-art methods and baselines for generative and extractive question answering, trained on Natural Questions, in zero shot and translation settings. Results indicate this dataset is challenging even in English, but especially in low-resource languages

arXiv.orgShayne Longpre

MKQA is an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages. The question-answer pairs are sampled from Google Natural Questions.

Model	AVG (recall@10)	ar	da	de	es	en	fi	fr	he	hu	it	ja	km	ko	ms	nl	no	pl	pt	ru	sv	th	tr	vi	zh_cn	zh_hk	zh_tw
jina-reranker-m0	68.19	63.88	70.57	70.52	71.26	73.47	64.10	71.11	63.68	63.23	70.30	69.13	50.43	64.30	70.78	71.73	70.25	69.72	70.57	70.78	70.69	69.80	67.90	69.68	69.12	68.23	67.79
jina-embeddings-v3 (8192 tokens)	65.63	59.00	69.12	68.27	68.15	71.14	65.66	68.30	59.51	63.23	68.30	64.36	56.13	58.98	68.30	69.53	68.65	67.26	67.93	67.06	68.68	66.32	66.97	66.87	63.38	63.59	61.55
bge-reranker-v2-m3	67.88	63.09	70.15	68.91	68.92	73.00	68.71	68.71	70.27	64.00	68.15	68.47	60.43	63.95	68.80	70.77	69.10	67.44	67.40	69.77	70.03	69.68	66.04	68.29	67.84	66.70	66.34
jina-reranker-v2-multilingual	67.90	63.88	70.31	70.09	70.51	73.09	67.50	70.38	63.00	64.59	69.90	67.34	57.79	62.14	70.36	71.58	69.51	68.61	70.13	70.07	70.15	68.80	68.02	69.39	67.23	65.77	65.37

tagCoIR (Text2Text, Code Information Retrieval)

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models

Despite the substantial success of Information Retrieval (IR) in various NLP tasks, most IR systems predominantly handle queries and corpora in natural language, neglecting the domain of code retrieval. Code retrieval is critically important yet remains under-explored, with existing methods and benchmarks inadequately representing the diversity of code in various domains and tasks. Addressing this gap, we present COIR (Code Information Retrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. COIR comprises ten meticulously curated code datasets, spanning eight distinctive retrieval tasks across seven diverse domains. We first discuss the construction of COIR and its diverse dataset composition. Further, we evaluate nine widely used retrieval models using COIR, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems. To facilitate easy adoption and integration within existing research workflows, COIR has been developed as a user-friendly Python framework, readily installable via pip. It shares same data schema as other popular benchmarks like MTEB and BEIR, enabling seamless cross-benchmark evaluations. Through COIR, we aim to invigorate research in the code retrieval domain, providing a versatile benchmarking tool that encourages further development and exploration of code retrieval systems https://github.com/CoIR-team/coir.

arXiv.orgXiangyang Li

CoIR is a comprehensive benchmark designed to evaluate models’ abilities in code retrieval. It includes 10 curated code datasets covering 8 retrieval tasks across 7 diverse domains. A Python framework is provided for this benchmark.

Model Name	Avg (NDCG@10)	Text-to-Code			Code-to-Text							Code-to-Code									Hybrid Code
		Apps	CosQA	SQL	CSN							CSN-CCR							CodeTransOcean		StackOver Flow	CodeFeedBack
		Apps	CosQA	SQL	AVG	python	javascript	go	ruby	java	php	AVG	python	javascript	go	ruby	java	php	-Contest	-DL	StackOver Flow	-MT	-ST
jina-reranker-m0	63.55	26.21	37.75	57.92	80.76	98.37	71.16	86.14	72.74	79.02	77.14	74.57	81.66	77.92	68.71	75.44	77.54	66.13	79.79	31.89	90.41	72.25	83.95
jina-embeddings-v2-base-code (top 100)	56.90	16.34	41.72	49.79	83.95	94.71	76.35	87.39	78.23	82.69	84.35	59.65	68.23	62.31	49.15	65.40	63.89	48.92	79.20	30.35	89.42	49.62	68.93
bge-reranker-v2-m3	35.97	8.33	30.06	50.63	49.26	67.62	39.55	58.11	41.37	44.77	44.13	40.81	42.57	42.75	38.04	38.04	41.73	41.73	34.93	5.09	60.12	16.44	64.05
jina-reranker-v2-multilingual	56.14	21.90	37.26	53.56	78.88	97.83	67.43	84.64	68.93	75.73	78.71	63.59	72.62	67.80	55.07	67.25	64.25	54.54	73.67	25.74	91.24	42.03	73.59

tagViDoRe (Text2Image, Visual Document Retrieval Benchmark)

ColPali: Efficient Document Retrieval with Vision Language Models

Documents are visually rich structures that convey information through text, but also figures, page layouts, tables, or even fonts. Since modern retrieval systems mainly rely on the textual information they extract from document pages to index documents -often through lengthy and brittle processes-, they struggle to exploit key visual cues efficiently. This limits their capabilities in many practical document retrieval applications such as Retrieval Augmented Generation (RAG). To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieval tasks spanning multiple domains, languages, and practical settings. The inherent complexity and performance shortcomings of modern systems motivate a new concept; doing document retrieval by directly embedding the images of the document pages. We release ColPali, a Vision Language Model trained to produce high-quality multi-vector embeddings from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically simpler, faster and end-to-end trainable. We release models, data, code and benchmarks under open licenses at https://hf.co/vidore.

arXiv.orgManuel Faysse

ViDoRe is a benchmark designed to evaluate retrieval systems on their capacity to match queries to relevant documents using visual features. It covers various page-level retrieving tasks across multiple domains and languages. The benchmark focuses on visual elements of documents.

Model Name	AVG (NDCG@5)	TAT-DQA	Shift Project	Artificial Intelligence	Government Reports	ArxivQA	DocVQA	Healthcare Industry	InfoVQA	Energy	TabFQuad
jina-reranker-m0	91.02	81.83	93.22	99.63	97.59	89.82	62.58	99.26	92.88	96.06	97.32
MrLight/dse-qwen2-2b-mr1-v1	84.48	66.64	79.39	96.45	95.30	84.53	55.47	96.85	86.39	91.80	92.03
MonoQwen2-VL-v0.1	87.64	79.50	76.38	98.39	93.63	89.50	57.47	98.39	92.12	95.29	95.75

tagM-BEIR (Text2Image, Image2Text, Multimodal BEnchmark for Instructed Retrieval)

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR’s generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.

arXiv.orgCong Wei

M-BEIR is a comprehensive large-scale retrieval benchmark designed to train and evaluate multimodal retrieval models. It comprises eight multimodal retrieval tasks and ten datasets from a variety of domains and sources. The benchmark focuses on instruction-following retrieval.

Model	MBEIR t2i VisualNews Recall@5	MBEIR t2i MSCOCO Recall@5	MBEIR t2i Fashion200K Recall@10	MBEIR i2t VisualNews Recall@5	MBEIR i2t MSCOCO Recall@5	MBEIR i2t Fashion200K Recall@10
jina-reranker-m0	23.89	72.19	9.79	17.61	41.21	11.56
jinaai/jina-clip-v2	15.42	52.28	7.03	11.63	28.80	8.78
MonoQwen2-VL-v0.1	22.74	71.29	10.00	15.08	42.24	11.25

tagWinoground (Text2Text, Text2Image)

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality

We present a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning, which we call Winoground. Given two images and two captions, the goal is to match them correctly - but crucially, both captions contain a completely identical set of words, only in a different order. The dataset was carefully hand-curated by expert annotators and is labeled with a rich set of fine-grained tags to assist in analyzing model performance. We probe a diverse range of state-of-the-art vision and language models and find that, surprisingly, none of them do much better than chance. Evidently, these models are not as skilled at visio-linguistic compositional reasoning as we might have hoped. We perform an extensive analysis to obtain insights into how future work might try to mitigate these models’ shortcomings. We aim for Winoground to serve as a useful evaluation set for advancing the state of the art and driving further progress in the field. The dataset is available at https://huggingface.co/datasets/facebook/winoground.

arXiv.orgTristan Thrush

Winoground is a novel task and dataset for evaluating the ability of vision and language models to conduct visio-linguistic compositional reasoning. It uses twin captions with identical word content and employs contrastive image-caption pairs. The focus is on compositional reasoning.

Model	Text	Image	Group	Avg
jina-reranker-m0	57.00	40.75	34.00	43.92
MrLight/dse-qwen2-2b-mrl-v1	7.50	9.25	1.75	6.17
MonoQwen2-VL-v0.1	52.00	36.25	31.50	39.92

Winoground evaluates vision-language models using three key metrics: Text Score, Image Score, and Group Score. The Text Score measures if a model correctly matches captions to images, while the Image Score assesses if it selects the right image for a caption. The Group Score, the most rigorous metric, requires all caption-image relationships to be correctly identified. Scores are percentages representing accuracy rates, with higher values indicating better reasoning abilities.