News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
DeepSearch
Search, read and reason until best answer found.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
How We Build JinaVDR
Existing Benchmarks
Evaluating Embeddings on JinaVDR
MTEB Integration
Limitations
Conclusion
Press release
July 25, 2025

JinaVDR: New Visual Document Retrieval Benchmark with 95 Tasks in 20 Languages

JinaVDR is a new benchmark spanning 95 tasks across 20 languages for visual document retrieval, soon on MTEB.
Black-and-white design for "Jinavor Benchmark" with bold text. Below, "Visual Docs: 95 Tasks: 20 Languages" appears; an abstr
Maximilian Werk
Alex C-G
Maximilian Werk, Alex C-G • 8 minutes read
GitHub - jina-ai/jina-vdr: Jina VDR is a multilingual, multi-domain benchmark for visual document retrieval
Jina VDR is a multilingual, multi-domain benchmark for visual document retrieval - jina-ai/jina-vdr
GitHubjina-ai
JinaVDR (Visual Document Retrieval) - a jinaai Collection
max. ~1000 images and OCR text included
a jinaai Collection

We're releasing JinaVDR (Visual Document Retrieval), a new benchmark for evaluating how well models retrieve visually complex documents. JinaVDR encompasses multilingual documents with intricate layouts—combining graphs, charts, tables, text, and images alongside scanned copies and screenshots. The benchmark pairs these diverse visual documents with targeted text queries, enabling comprehensive evaluation of retrieval performance across real-world document complexity and broader domain coverage.

Benchmark Task focus Languages Number of tasks
JinaVDR Visually rich documents 20 languages 95
MIEB Mostly natural images 38 languages 130
ViDoRe v1 Visually rich documents English 5
ViDoRe v2 Visually rich documents English, French, Spanish, German 4
Four pie charts present data distributions: query language, dataset domains, document formats, and language usage. Colors ran
JinaVDR statistics, showing query/document languages, domains and document formats

JinaVDR spans diverse languages, domains, and document formats to reflect real-world retrieval scenarios. While English remains predominant in both queries and documents, the benchmark incorporates over a dozen additional languages, providing significantly broader multilingual coverage. The domains encompass historic documents, software documentation, medical records, legal texts, and scientific papers, capturing varied professional use cases. Document formats range from web pages and PDFs to scanned materials, presentation slides, and standalone images. Many datasets intentionally mix languages and formats, creating realistic conditions that challenge models to handle the complexity they encounter in practical applications.

tagHow We Build JinaVDR

JinaVDR benchmark provides an evaluation framework spanning 95 tasks across 20 languages, including domain-diverse and layout-rich documents like charts, maps, traditional scanned documents, Markdown files, and complex tables. It evaluates models through both visual question answering (for example, “How many civil lawsuits were dismissed at the Valladolid audience in 1855?”) and keyword querying (for example, “growth of the LED market across different regions”), giving a clearer assessment of retrieval capabilities on different document types found in the real world.

We used four techniques to construct JinaVDR with a focus on data diversity and task authenticity:

First, we repurposed existing benchmarks by converting OCR datasets into retrieval tasks using rule-based query templates (such as transforming MPMQA data) and reformatting question-answering datasets into retrieval scenarios:

The image features instructions for connecting a smartwatch to a smartphone, with a highlighted query, "Can I connect my watc
Sample MPMQA document and query from JinaVDR benchmark

Second, we manually annotated existing PDF datasets including StanfordSlides, TextbookQA, and ShanghaiMasterPlan, to create high-quality retrieval pairs:

Slide titled "Advantages of Perforated Metal" with a list of benefits, including durability, lightweight design, transparency
Sample StanfordSlides document and query from JinaVDR benchmark

Our third approach involved synthetic query and/or document generation, where we used existing document collections from sources like Europeana to create contextually relevant queries with Qwen2-VL-7B-Instruct, along with EasyOCR text descriptions:

Historical newspaper page from December 2, 1854, with the German text: "Wie viele Personen wurden am 2. Dezember 1854 in Alto
Sample Europeana document and query from JinaVDR benchmark, including English query translation for reference

We also rendered tabular datasets into visual tables and generated corresponding queries through templates derived from the original text data, as demonstrated in our AirBnBRetrieval task.

Finally, we repurposed existing crawled datasets with article-chart pairs, where we use text snippets from the articles as queries and corresponding charts as target documents, as shown in our OWIDRetrieval dataset:

Color-coded world map showing share of one-year-olds vaccinated against Hepatitis B in 2021, with accompanying text on Hepati
Sample OWIDRetrieval document and query from JinaVDR benchmark

This multi-faceted approach gives us comprehensive coverage across document types, languages, and retrieval scenarios.

tagExisting Benchmarks

Developing truly multimodal models (that can handle visually complex documents) requires benchmarks that go beyond traditional text-only evaluation methods. Frameworks like MTEB (Massive Text Embedding Benchmark) may work well for evaluating text retrieval across different domains and languages, but they’re not built for searching through documents where accurate retrieval depends on visual layout, charts, tables, and formatting. This is where visual document retrieval benchmarks (like the ViDoRe series) and image retrieval benchmarks (like MIEB, the Massive Image Embedding Benchmark) come in.

The ColPali paper introduced ViDoRe v1, which combines five English-language datasets, both academic and synthetic. The benchmark focuses on single-page documents that work well with optical character recognition (OCR), covers narrow domains like scientific papers and healthcare, and uses extractive queries where search terms often appear directly in target documents.

Series of professional pages with topics like "Tax-efficient investing," "Who we are?" and graphs, suggesting finance and inv
Samples from ViDoRe v1 benchmark dataset

After models like ColPali hit a score of 90% nDCG@5 on ViDoRe v1, a new benchmark was needed. ViDoRe v2 improved upon v1 by supporting longer and cross-document queries, blind contextual querying, and more languages (French, German, and Spanish on top of English). Both benchmarks still have limited language diversity and narrow domain coverage, leaving gaps for evaluating new retrieval systems.

A professional brochure layout combining community support themes, such as an elderly woman baking, with medical research top
Samples from ViDoRe v2 benchmark dataset

The MIEB takes a different approach by focusing on visual embeddings across 130 tasks, including other tasks beyond just retrieval. However, it mostly evaluates images without much text content, rather than visually rich documents. While the benchmark is excellent at testing visual understanding capabilities, it doesn’t perform well when you need to retrieve documents based on visual layout and textual content.

Massive Image Embedding Benchmark with 130 tasks in 38 languages. Tasks like Retrieval, Clustering, and VisualSTS are illustr
Samples from MIEB benchmark

Our aim with the JinaVDR (Visual Document Retrieval) benchmark is to expand upon the work of these prior benchmarks by incorporating visually rich multilingual documents with complex layouts like graphs, charts, and tables (mixed with text and images), as well as adding real-world queries and questions.

tagEvaluating Embeddings on JinaVDR

💡
You can run the benchmark for yourself with the code on our GitHub repo.

Our benchmarking results show that many recent embedding models struggle with JinaVDR’s wide range of visual document tasks, while OCR-based baselines and older models show even weaker results, especially on non-English and structured-document datasets. We included BM25 with OCR for all datasets where simple text extraction made such retrieval feasible.

An exception to this is jina-embeddings-v4. Our results suggest that its multimodal embedding approach handles complex and multilingual document retrieval better than earlier generation models or traditional OCR-based pipelines. The model’s multi-vector capability provides the best performance because it avoids the compression limitations of single-vector approaches — while single vectors have to cram an entire page’s content into one representation (making it difficult to capture specific details), the multi-vector approach maintains the granular information needed for precise retrieval of similar documents.

Bar chart titled "Average Model Performance on Jina VDR Benchmark" comparing the performance of various AI models, labeled al
Figure 9: Model performance on JinaVDR benchmark, averaged over all tasks
Average medical-prescriptions DonutVQA TableVQA europeana-de-news europeana-es-news europeana-it-scans europeana-nl-legal hindi-gov-vqa jdocqa_jp wikimedia-commons-documents (ar) github-readme-retrieval-ml-filtered (ru)
BM25 + OCR 26.67% 38.18% 19.39% 35.64% 11.26% 51.99% 39.11% 34.97% 1.83% 1.64% 19.60% 39.78%
jina-embeddings-v3 + OCR 27.49% 37.25% 2.60% 34.24% 12.05% 44.03% 38.69% 29.07% 7.52% 7.79% 38.06% 51.07%
jina-clip-v2 17.79% 15.66% 1.63% 21.06% 11.19% 13.14% 16.23% 9.79% 5.02% 19.91% 45.29% 36.80%
colpali-v1.2 46.44% 83.91% 32.53% 54.66% 34.64% 44.74% 54.32% 30.89% 13.04% 39.45% 41.96% 80.67%
colqwen2-v0.1 58.26% 77.72% 46.34% 57.52% 53.42% 74.28% 71.23% 46.13% 20.53% 74.38% 36.94% 0.82388
MrLight/dse-qwen2-2b-mrl-v1 47.95% 38.22% 25.31% 57.39% 44.75% 60.58% 53.92% 29.50% 9.80% 66.73% 62.47% 78.77%
jina-embeddings-v4 (single-vector) 61.39% 81.17% 78.48% 58.90% 49.05% 60.10% 57.88% 37.14% 15.40% 75.57% 72.07% 89.55%
jina-embeddings-v4 (multivector) 70.89% 97.95% 73.55% 60.91% 65.65% 80.58% 73.14% 54.15% 21.94% 82.34% 81.19% 88.39%

tagMTEB Integration

dataset: Add JinaVDR by maximilianwerk · Pull Request #2942 · embeddings-benchmark/mteb
Hey, we would like to contribute the JinaVDR benchmark to MTEB. Our aim with the JinaVDR (Visual Document Retrieval) benchmark is to expand upon the work of these prior benchmarks by incorporating…
GitHubembeddings-benchmark

Since MTEB has become the de facto standard for retrieval benchmarking, we're integrating JinaVDR directly into the MTEB framework to maximize adoption and ease of use. This makes it easier for researchers to run visual retrieval models on our benchmark using familiar evaluation infrastructure. However, migrating our data to the BEIR format did require some trade-offs, like not including OCR results in the MTEB version. This means that traditional text-based methods like BM25 aren't directly runnable as part of the MTEB, which reinforces the focus on visual document understanding rather than falling back to text-based retrieval methods.

tagLimitations

To build a comprehensive benchmark from a wide range of sources, we had to perform careful preprocessing to ensure both practical usability and evaluation quality: We applied size normalization by subsampling each dataset to a maximum of 1,000 examples (down from thousands or tens of thousands), making the benchmark actually runnable while maintaining good coverage across tasks. This constraint was especially important given the high levels of compute we needed to process high-resolution visual documents.

We used quality filtering to address several challenges common in real-world document collections. While poor image quality in scanned documents often reflects realistic use cases, it made it tougher to control the quality of synthetic data. We implemented consistency filtering to remove duplicates (which are common across large document collections) and used LLMs to filter out low-quality queries that wouldn't provide useful evaluation signals, such as overly generic questions like "What can you see in the chart?". For synthetic data generation, we ran into limitations in query diversity despite using varied prompting strategies, and needed to perform manual curation to ensure enough evaluation coverage across different retrieval scenarios.

tagConclusion

Visual document retrieval evaluation is now in a situation where traditional text-based benchmarks no longer capture the complexity of how humans actually search for and consume information. JinaVDR overcomes this barrier by providing comprehensive evaluation across a range of tasks and languages far exceeding previous benchmarks.

Moving forwards, the industry needs benchmarks that reflect genuine retrieval challenges rather than artificial constraints. As organizations come to rely more on visual document retrieval for tasks from legal research to medical diagnostics, evaluation frameworks have to evolve beyond narrow academic datasets toward the messy, multilingual, and visually complex documents that we find in the real world. JinaVDR is just the first step in building retrieval systems that truly understand how visual and textual information work together in practice.

Categories:
Press release
rss_feed
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.