How Image Resolution Impacts Visual Document Retrieval

Traditional computer vision models typically focus on mimicking human visual perception. jina-embeddings-v4 takes a different approach: it combines image and text processing to understand how people read and interpret information presented visually. Unlike OCR programs that simply digitize text, it actually parses complex visual materials like infographics, charts, diagrams, and tables—documents where both the text and visual elements carry semantic meaning. We call these "visually rich documents."

Four graphs visualizing COVID-19-related data, such as telework impact and fiscal policies, presented in a multilingual forma — Figure 1: Examples of visually rich documents from the JinaVDR benchmark collection.

If we just used OCR and text embeddings models, we would miss important information. If we used conventional embedding models trained on image-caption pairs, we wouldn’t be able to extract the semantics of the text. And in the case of things like tables, we have to know what the text means and process the spatial relationships between text elements to properly process them. We have to process them in visually-aware ways, which is why it’s called visual document retrieval.

Figure 2: A table rendered as an image, from the JinaVDR benchmark collection. Spatial relations between the words are very important to making sense of the table’s meaning.

But you have to be able to see things to make sense of them, and it’s no different for embedding models. Image quality counts.

A forest clearing with fallen trees and mud, featuring a mysterious, dark humanoid figure resembling Bigfoot walking to the l — Figure 3: A frame from the Patterson–Gimlin film. Is it Bigfoot? A blurry bear? A guy in a gorilla suit? Poor image quality makes it hard to say.

Image quality differences can come from many sources: photos are out of focus, lighting conditions are bad, movement causes blurring, and lossy compression algorithms destroy details. But visually rich documents are typically “born” digital. They consist of things like screenshots, presentation slides, and glossy PDFs — data that has been rendered into images by some presentation or publication process. Sometimes, they’re scans of actual paper pages, but if competently made, those images don’t suffer from the kinds of problems that are pervasive with pictures of real-world scenes.

Three vertically aligned posters compare social media use with pixel dimensions listed: 551,936, 37,632, and 9,408. Dark back — Figure 4: The effect of changing image resolution. The smallest text is lost the fastest and we can’t expect embedding models to make sense of text that’s just a blur. Even the visual content is lost if the resolution is low enough.

For visually rich documents, the primary input quality problem is image resolution. Too small, and the information we need for retrieval is lost. But too big, and we flood the model with spurious details that undermine accurate processing. Right-sizing your inputs to jina-embeddings-v4 will save you money and improve your retrieval results.

This article looks at how jina-embeddings-v4 processes images and the effect of image resolution on its ability to retrieve visually rich documents. We’ve provided code in the JinaVDR repo on Hugging Face to reproduce our results or test your own data if you want to apply ideas from this article yourself.

tagVisual Language Models

jina-embeddings-v4 is a visual language model (VLM) that extends the Qwen2.5-VL-3B-Instruct VLM. This approach to embedding images is different from single-modality image or text models and CLIP-style models like jina-clip-v2.

Figure 5 is a schematic of the jina-embeddings-v4 model architecture. The backbone model is similar to a conventional transformer-based embedding model, except that it supports dual output modes: a single-vector (dense vector) embedding produced by mean-pooling the final layer of the decoder, and a multi-vector (late interaction) output produced by a projection layer that is the same size as the input.

For more information about jina-embeddings-v4 and its two output modes, see the release notice and technical report.

Diagram illustrating a model pipeline that processes inputs like document/image, text, and task:retrieval into single-vector — Figure 5: The architecture of jina-embeddings-v4, from the version 4 Technical Report.

For texts, its input is the same as conventional text embedding models: Texts are tokenized, and the tokens are replaced by substituting vectors from a lookup table. These token vectors together serve as input to the model.

The innovation of VLMs is in their handling of images: A conventional image embedding model is connected to the text embedding model, but instead of producing an image embedding by mean pooling, its final layer becomes the input to the text embedding model, as if it were just a sequence of token vectors.

Flowchart contrasts text input processing steps (tokenization, lemmatization, and more) with direct image input embedding, cu — Figure 6: How jina-embeddings-v4 processes images compared to text.

This means that large images have the same problems for embedding models that long texts have. Longer inputs increase computational costs and encoding times and often produce uninformative embeddings. Small images have the opposite problem: If too much information is packed into one patch, it gets attenuated in the embedding process and gets lost in single-vector embeddings, while for multi-vector embeddings, it reduces the strength of matches. There is a trade-off between the amount of information provided to the model and its ability to accurately extract relevant information.

But texts can’t be resized; images can.

tagImage Preprocessing in jina-embeddings-v4

When you submit an image to jina-embeddings-v4, it’s split into patches of 28x28 pixels. This means that the number of patches is roughly proportional to the size of the image. If you use the Jina API, the maximum size supported is 602,112 pixels, but it’s a user-settable parameter in the model code if you download it.

The VLM architecture means that the number of 28x28 image patches the model can support is equal to the number of tokens of text it can support. That places an absolute upper bound on image size at roughly 20 megapixels.

Comparison of three side-by-side versions of an infographic, labeled as Low Resolution, High Resolution, and Original Image, — Figure 7: How resolution affects the number of patches.

tagExperiments with Image Resolution

We evaluated different image resolutions using jina-embeddings-v4 with the ViDoRe v1 and v2 suites, and part of the JinaVDR benchmark suite, looking at both single-vector and multi-vector outputs. Adjusting the max_pixels parameter of the model to a range of values up to 19,267,584 pixels (i.e., 5,376x3,584) produces embeddings from different image resolutions. Where an image is already smaller than the value of max_pixels, it is not changed.

tagViDoRe v1

Table 1 reports single-vector (dense vector) embedding retrieval results for five resolutions on individual ViDoRe v1 benchmarks (average nDCG@5 score). We only report values up to 9,633,792 pixels because no images in ViDoRe v1 were larger than that.

Presentation slides with charts and text discussing topics like "Tax-efficient investing," "Our team," and "Key investments," — Figure 8: Examples of documents from the ViDoRe v1 benchmark suite.

Benchmark Dataset	301,056 px	602,112 px	1,204,224 px	2,408,448 px	4,816,896 px	9,633,792 px
Max embedding vector size in tokens	384	768	1,536	3,072	6,144	12,288
`arxivqa_test_subsampled`	0.83487	0.84529	0.84537	0.83785	0.83439	0.83469
`docvqa_test_subsampled`	0.47366	0.50715	0.52421	0.51075	0.50287	0.50258
`infovqa_test_subsampled`	0.84404	0.87510	0.87890	0.87978	0.87672	0.87710
`shiftproject_test`	0.77524	0.81494	0.83988	0.84427	0.84127	0.84196
`syntheticDocQA_artificial_intelligence_test`	0.93809	0.96786	0.96655	0.97155	0.97024	0.97024
`syntheticDocQA_energy_test`	0.86865	0.89540	0.89847	0.91172	0.91286	0.91286
`syntheticDocQA_government_reports_test`	0.91708	0.93417	0.93865	0.92309	0.91609	0.91609
`syntheticDocQA_healthcare_industry_test`	0.93865	0.96428	0.96024	0.96542	0.95417	0.95286
`tabfquad_test_subsampled`	0.94298	0.94853	0.94502	0.94505	0.94505	0.94612
`tatdqa_test`	0.58622	0.64832	0.65867	0.65985	0.65395	0.65498
AVERAGE	0.81195	0.84010	0.84560	0.84493	0.84076	0.84095

When a higher resolution is tied with a lower one, it means there were no images in that benchmark larger than the lower resolution, so no resizing occurred.

You can see in Table 1 that the highest resolution is far from the best. 9.6 megapixels performed below the best resolution on tests, except the one with no images larger than 4.8 megapixels, where it got the same score because there was no resizing. The best performer on average was 1.2 megapixels, and the resolution with the most best scores on different benchmarks was 2.4 megapixels.

If we took the best score on each benchmark and averaged them, we would get 0.84905. This is only 0.345% better than the score for 1.2 megapixels.

Using multi-vector embeddings, as shown in Table 2, our retrieval scores are significantly higher. Multi-vector matching typically performs better than single-vector matching. However, the performance of different resolutions is roughly the same.

Benchmark Dataset	301,056 px	602,112 px	1,204,224 px	2,408,448 px	4,816,896 px	9,633,792 px
Max embedding vector size in tokens	384	768	1,536	3,072	6,144	12,288
`arxivqa_test_subsampled`	0.87456	0.88881	0.88736	0.88531	0.88899	0.89052
`docvqa_test_subsampled`	0.55344	0.61284	0.61123	0.59941	0.59087	0.59229
`infovqa_test_subsampled`	0.88777	0.92646	0.93376	0.94007	0.93459	0.93533
`shiftproject_test`	0.86224	0.90563	0.93547	0.92847	0.92240	0.92240
`syntheticDocQA_artificial_intelligence_test`	0.99631	0.99131	0.99500	0.99262	0.99262	0.99262
`syntheticDocQA_energy_test`	0.95216	0.96524	0.96524	0.96893	0.96762	0.96762
`syntheticDocQA_government_reports_test`	0.95934	0.97085	0.97524	0.98024	0.96655	0.96655
`syntheticDocQA_healthcare_industry_test`	0.97893	0.97893	0.99631	0.98524	0.98393	0.98393
`tabfquad_test_subsampled`	0.95386	0.95732	0.95611	0.95379	0.95379	0.95379
`tatdqa_test`	0.70547	0.78534	0.79516	0.80422	0.80552	0.80727
AVERAGE	0.87241	0.89827	0.90509	0.90383	0.90069	0.90123

Just like for single-vector embeddings, 1.2 megapixels resolution has the highest average score, and 2.4 megapixels is the best scoring resolution for the largest number of individual datasets. The average of best scores over all benchmarks is 0.90853, and the 1.2 megapixel resolution’s average score is 0.344% lower, almost exactly the same as for the single-vector case.

tagViDoRe v2

ViDoRe v2 uses much more detailed and colorful imagery than ViDoRe v1, which suggests that the optimal resolution will likely be higher.

Collage of "Key Moments & Milestones" highlighting sustainability through compostable materials, recycled packaging, and pack — Figure 9: Examples of ViDoRe v2 documents.

We tested multi-vector embeddings from jina-embeddings-v4 on the ViDoRe v2 benchmark suite, with results in Table 3. Single-vector results are similar but with lower scores, so we omit them here.

Benchmark Dataset	150,528 px	301,056 px	602,112 px	1,204,224 px	2,408,448 px	4,816,896 px	9,633,792 px
Max embedding vector size in tokens	192	384	768	1,536	3,072	6,144	12,288
`esg_reports_v2`	0.40444	0.54013	0.52005	0.51916	0.49953	0.52664	0.51442
`biomedical_lectures_v2`	0.58760	0.60479	0.6184	0.60748	0.60748	0.60748	0.60748
`economics_reports_v2`	0.47666	0.50399	0.54216	0.54998	0.54998	0.54998	0.54998
`esg_reports_human_labeled_v2`	0.42171	0.56940	0.61227	0.57307	0.61108	0.63858	0.64921
AVERAGE	0.47260	0.55458	0.57322	0.56242	0.56702	0.58067	0.58027

In this case, the best performer on average is 4.8 megapixel resolution, with average performance at 9.6 megapixel essentially identical. Nonetheless, for three of the four benchmarks, the highest resolution performed worse than lower resolutions, except for one test where no images were larger than 4.8 megapixels.

tagHigh-Resolution Benchmarks

ViDoRe v1 and ViDoRe v2 consist of images with similar native resolutions, so we took two benchmarks from the JinaVDR suite that contain very high-resolution, difficult-to-process images, and performed the same multi-vector embeddings test with them.

One is the europeana-de-news benchmark, which contains high-resolution scans of German newspapers from the 17th to 20th centuries; the other is the wikimedia-commons-maps benchmark, which contains very-high-resolution scans of printed maps, mostly from the pre-digital era.

Historical map of Stuttgart and surrounding areas on the left, with a section of explanatory text in German on the right, sug — Figure 10: (Left) A 4000x2968 pixel map from an 18th-century print, included in `wikimedia-commons-maps`. (Right) Front page of the *Hamburger Nachrichten* newspaper, dated 26 April 1819 and included in `europeana-de-news`, scanned to 4324x4738 pixels.

The results are very mixed, as you can see in Table 4.

Benchmark Dataset	301,056 px	602,112 px	1,204,224 px	2,408,448 px	4,816,896 px	9,633,792 px	19,267,584 px
Max embedding vector size in tokens	384	768	1,536	3,072	6,144	12,288	24,576
`europeana-de-news`	0.46319	0.59457	0.66802	0.66550	0.66407	0.63948	0.65208
`wikimedia-commons-maps`	0.23671	0.34421	0.42835	0.52268	0.53464	0.53464	0.53588
AVERAGE	0.34995	0.46939	0.54819	0.59409	0.59936	0.58706	0.59398

The newspaper data clearly does not require the same resolution as the map data. This is likely because the text in the maps is very small compared to the overall image size and it really needs to be readable for retrieval to work. When the newspaper scans are sized larger than optimal, performance fluctuates and declines.

The average is suboptimal for both, suggesting that there is no one right resolution to solve the problem. It’s that discovery that motivates what we did next.

tagMulti-Resolution Embeddings

The VLM architecture of jina-embeddings-v4 treats individual patches of images the same way it treats text tokens, so it’s easy for us to augment high-resolution patches with low-resolution ones. It just means more input data, so long as we stay within the maximum that the model supports. If lower-resolution yields better semantics than high-resolution, we can just include them and not worry about what the optimal resolution is.

To test this hypothesis, we looked at three combinations of resolutions:

	Mix 1	Mix 2	Mix 3
Maximum number of total tokens	2,880	5,234	12,096
Resolutions	150,528 px<br>301,056 px<br>602,112 px<br>1,204,224 px	50,000 px<br>90,000 px<br>160,000 px<br>250,000 px<br>360,000 px<br>490,000 px<br>602,112 px<br>900,000 px<br>1,204,224 px	150,528 px<br>301,056 px<br>602,112 px<br>1,204,224 px<br>2,408,448 px<br>4,816,896 px

This means that when we tested Mix 1, we resized each image to four different resolutions — 150,528 px, 301,056 px, 602,112 px, 1,204,224 px — and processed each of them individually into multi-vector embeddings, then concatenated the results before doing query matching. Same for Mix 2 and Mix 3, but at different resolutions. We tried all three combinations on the ViDoRe v2 benchmarks, summarized in Table 6.

Benchmark Dataset	Best Single Resolution	Mix 1	Mix 2	Mix 3
Max embedding vector size in tokens	—	2,880	5,234	12,096
`esg_reports_v2`	0.54013	0.58354	0.59252	0.56567
`biomedical_lectures_v2`	0.61840	0.61678	0.61714	0.61638
`economics_reports_v2`	0.54998	0.54997	0.55534	0.55049
`esg_reports_human_labeled_v2`	0.64921	0.67726	0.68057	0.66734
AVERAGE	0.58943	0.60689	0.61139	0.59997

For all ViDoRe v2 benchmarks, Mix 2 outperforms Mixes 1 and 3, suggesting that more different resolutions produce better results than adding higher resolutions. On two of four benchmarks, all mixtures of resolutions outperform the best single resolution, and for the remaining two, only Mix 1 underperforms, and then not by a large margin.

These results show that although there is no single best mix of resolutions for all data, identifying a good mix of resolutions is the right direction to go to find an optimal solution.

tagConclusion

Image resolution matters quite a lot for jina-embeddings-v4 when processing visually rich materials. One key issue is that text should be sized to be readable. What you can’t read, AI can’t read either, and if the text matters, then it has to be read.

But, too high a resolution and it becomes difficult for the embedding model to assemble image patches into a coherent whole. It’s also expensive: Higher resolutions mean more processing and, for multi-vector embeddings, more storage and slower matching.

Using multiple resolutions and applying late interaction-style scoring on all the outputs together is a good way to handle visually rich images with varying sizes. But, this adds to processing and storage costs, and slows down retrieval the same way that very large resolutions do.

We’re looking into ways we can operationalize this insight to improve neural search. We are constantly trying to diversify our training and test data to probe our models for shortcomings and devise ways to improve them. We are looking into the effects of resolution and multi-resolution techniques like the ones described here on various types of materials.

We are also doing experiments to see if multiple-resolution techniques like the ones described here mitigate the effect of noise in images and produce more robust retrieval.

Furthermore, it may be possible to automatically determine in advance the optimal resolution for each image. If we can reliably detect the best resolution, it would remove one more parameter for users while improving overall results, making embeddings more accessible and usable.