Traditional computer vision models typically focus on mimicking human visual perception. jina-embeddings-v4 takes a different approach: it combines image and text processing to understand how people read and interpret information presented visually. Unlike OCR programs that simply digitize text, it actually parses complex visual materials like infographics, charts, diagrams, and tables—documents where both the text and visual elements carry semantic meaning. We call these "visually rich documents."

If we just used OCR and text embeddings models, we would miss important information. If we used conventional embedding models trained on image-caption pairs, we wouldn’t be able to extract the semantics of the text. And in the case of things like tables, we have to know what the text means and process the spatial relationships between text elements to properly process them. We have to process them in visually-aware ways, which is why it’s called visual document retrieval.

But you have to be able to see things to make sense of them, and it’s no different for embedding models. Image quality counts.

Image quality differences can come from many sources: photos are out of focus, lighting conditions are bad, movement causes blurring, and lossy compression algorithms destroy details. But visually rich documents are typically “born” digital. They consist of things like screenshots, presentation slides, and glossy PDFs — data that has been rendered into images by some presentation or publication process. Sometimes, they’re scans of actual paper pages, but if competently made, those images don’t suffer from the kinds of problems that are pervasive with pictures of real-world scenes.

For visually rich documents, the primary input quality problem is image resolution. Too small, and the information we need for retrieval is lost. But too big, and we flood the model with spurious details that undermine accurate processing. Right-sizing your inputs to jina-embeddings-v4 will save you money and improve your retrieval results.
This article looks at how jina-embeddings-v4 processes images and the effect of image resolution on its ability to retrieve visually rich documents. We’ve provided code in the JinaVDR repo on Hugging Face to reproduce our results or test your own data if you want to apply ideas from this article yourself.
tagVisual Language Models
jina-embeddings-v4 is a visual language model (VLM) that extends the Qwen2.5-VL-3B-Instruct
VLM. This approach to embedding images is different from single-modality image or text models and CLIP-style models like jina-clip-v2.
Figure 5 is a schematic of the jina-embeddings-v4 model architecture. The backbone model is similar to a conventional transformer-based embedding model, except that it supports dual output modes: a single-vector (dense vector) embedding produced by mean-pooling the final layer of the decoder, and a multi-vector (late interaction) output produced by a projection layer that is the same size as the input.
For texts, its input is the same as conventional text embedding models: Texts are tokenized, and the tokens are replaced by substituting vectors from a lookup table. These token vectors together serve as input to the model.
The innovation of VLMs is in their handling of images: A conventional image embedding model is connected to the text embedding model, but instead of producing an image embedding by mean pooling, its final layer becomes the input to the text embedding model, as if it were just a sequence of token vectors.

This means that large images have the same problems for embedding models that long texts have. Longer inputs increase computational costs and encoding times and often produce uninformative embeddings. Small images have the opposite problem: If too much information is packed into one patch, it gets attenuated in the embedding process and gets lost in single-vector embeddings, while for multi-vector embeddings, it reduces the strength of matches. There is a trade-off between the amount of information provided to the model and its ability to accurately extract relevant information.
But texts can’t be resized; images can.
tagImage Preprocessing in jina-embeddings-v4
When you submit an image to jina-embeddings-v4, it’s split into patches of 28x28 pixels. This means that the number of patches is roughly proportional to the size of the image. If you use the Jina API, the maximum size supported is 602,112 pixels, but it’s a user-settable parameter in the model code if you download it.
The VLM architecture means that the number of 28x28 image patches the model can support is equal to the number of tokens of text it can support. That places an absolute upper bound on image size at roughly 20 megapixels.

tagExperiments with Image Resolution
We evaluated different image resolutions using jina-embeddings-v4 with the ViDoRe v1 and v2 suites, and part of the JinaVDR benchmark suite, looking at both single-vector and multi-vector outputs. Adjusting the max_pixels
parameter of the model to a range of values up to 19,267,584 pixels (i.e., 5,376x3,584) produces embeddings from different image resolutions. Where an image is already smaller than the value of max_pixels
, it is not changed.
tagViDoRe v1
Table 1 reports single-vector (dense vector) embedding retrieval results for five resolutions on individual ViDoRe v1 benchmarks (average nDCG@5 score). We only report values up to 9,633,792 pixels because no images in ViDoRe v1 were larger than that.

Benchmark Dataset | 301,056 px | 602,112 px | 1,204,224 px | 2,408,448 px | 4,816,896 px | 9,633,792 px |
---|---|---|---|---|---|---|
Max embedding vector size in tokens | 384 | 768 | 1,536 | 3,072 | 6,144 | 12,288 |
arxivqa_test_subsampled |
0.83487 | 0.84529 | 0.84537 | 0.83785 | 0.83439 | 0.83469 |
docvqa_test_subsampled |
0.47366 | 0.50715 | 0.52421 | 0.51075 | 0.50287 | 0.50258 |
infovqa_test_subsampled |
0.84404 | 0.87510 | 0.87890 | 0.87978 | 0.87672 | 0.87710 |
shiftproject_test |
0.77524 | 0.81494 | 0.83988 | 0.84427 | 0.84127 | 0.84196 |
syntheticDocQA_artificial_intelligence_test |
0.93809 | 0.96786 | 0.96655 | 0.97155 | 0.97024 | 0.97024 |
syntheticDocQA_energy_test |
0.86865 | 0.89540 | 0.89847 | 0.91172 | 0.91286 | 0.91286 |
syntheticDocQA_government_reports_test |
0.91708 | 0.93417 | 0.93865 | 0.92309 | 0.91609 | 0.91609 |
syntheticDocQA_healthcare_industry_test |
0.93865 | 0.96428 | 0.96024 | 0.96542 | 0.95417 | 0.95286 |
tabfquad_test_subsampled |
0.94298 | 0.94853 | 0.94502 | 0.94505 | 0.94505 | 0.94612 |
tatdqa_test |
0.58622 | 0.64832 | 0.65867 | 0.65985 | 0.65395 | 0.65498 |
AVERAGE | 0.81195 | 0.84010 | 0.84560 | 0.84493 | 0.84076 | 0.84095 |
You can see in Table 1 that the highest resolution is far from the best. 9.6 megapixels performed below the best resolution on tests, except the one with no images larger than 4.8 megapixels, where it got the same score because there was no resizing. The best performer on average was 1.2 megapixels, and the resolution with the most best scores on different benchmarks was 2.4 megapixels.
If we took the best score on each benchmark and averaged them, we would get 0.84905. This is only 0.345% better than the score for 1.2 megapixels.
Using multi-vector embeddings, as shown in Table 2, our retrieval scores are significantly higher. Multi-vector matching typically performs better than single-vector matching. However, the performance of different resolutions is roughly the same.
Benchmark Dataset | 301,056 px | 602,112 px | 1,204,224 px | 2,408,448 px | 4,816,896 px | 9,633,792 px |
---|---|---|---|---|---|---|
Max embedding vector size in tokens | 384 | 768 | 1,536 | 3,072 | 6,144 | 12,288 |
arxivqa_test_subsampled |
0.87456 | 0.88881 | 0.88736 | 0.88531 | 0.88899 | 0.89052 |
docvqa_test_subsampled |
0.55344 | 0.61284 | 0.61123 | 0.59941 | 0.59087 | 0.59229 |
infovqa_test_subsampled |
0.88777 | 0.92646 | 0.93376 | 0.94007 | 0.93459 | 0.93533 |
shiftproject_test |
0.86224 | 0.90563 | 0.93547 | 0.92847 | 0.92240 | 0.92240 |
syntheticDocQA_artificial_intelligence_test |
0.99631 | 0.99131 | 0.99500 | 0.99262 | 0.99262 | 0.99262 |
syntheticDocQA_energy_test |
0.95216 | 0.96524 | 0.96524 | 0.96893 | 0.96762 | 0.96762 |
syntheticDocQA_government_reports_test |
0.95934 | 0.97085 | 0.97524 | 0.98024 | 0.96655 | 0.96655 |
syntheticDocQA_healthcare_industry_test |
0.97893 | 0.97893 | 0.99631 | 0.98524 | 0.98393 | 0.98393 |
tabfquad_test_subsampled |
0.95386 | 0.95732 | 0.95611 | 0.95379 | 0.95379 | 0.95379 |
tatdqa_test |
0.70547 | 0.78534 | 0.79516 | 0.80422 | 0.80552 | 0.80727 |
AVERAGE | 0.87241 | 0.89827 | 0.90509 | 0.90383 | 0.90069 | 0.90123 |
Just like for single-vector embeddings, 1.2 megapixels resolution has the highest average score, and 2.4 megapixels is the best scoring resolution for the largest number of individual datasets. The average of best scores over all benchmarks is 0.90853, and the 1.2 megapixel resolution’s average score is 0.344% lower, almost exactly the same as for the single-vector case.
tagViDoRe v2
ViDoRe v2 uses much more detailed and colorful imagery than ViDoRe v1, which suggests that the optimal resolution will likely be higher.

We tested multi-vector embeddings from jina-embeddings-v4 on the ViDoRe v2 benchmark suite, with results in Table 3. Single-vector results are similar but with lower scores, so we omit them here.
Benchmark Dataset | 150,528 px | 301,056 px | 602,112 px | 1,204,224 px | 2,408,448 px | 4,816,896 px | 9,633,792 px |
---|---|---|---|---|---|---|---|
Max embedding vector size in tokens | 192 | 384 | 768 | 1,536 | 3,072 | 6,144 | 12,288 |
esg_reports_v2 |
0.40444 | 0.54013 | 0.52005 | 0.51916 | 0.49953 | 0.52664 | 0.51442 |
biomedical_lectures_v2 |
0.58760 | 0.60479 | 0.6184 | 0.60748 | 0.60748 | 0.60748 | 0.60748 |
economics_reports_v2 |
0.47666 | 0.50399 | 0.54216 | 0.54998 | 0.54998 | 0.54998 | 0.54998 |
esg_reports_human_labeled_v2 |
0.42171 | 0.56940 | 0.61227 | 0.57307 | 0.61108 | 0.63858 | 0.64921 |
AVERAGE | 0.47260 | 0.55458 | 0.57322 | 0.56242 | 0.56702 | 0.58067 | 0.58027 |
In this case, the best performer on average is 4.8 megapixel resolution, with average performance at 9.6 megapixel essentially identical. Nonetheless, for three of the four benchmarks, the highest resolution performed worse than lower resolutions, except for one test where no images were larger than 4.8 megapixels.
tagHigh-Resolution Benchmarks
ViDoRe v1 and ViDoRe v2 consist of images with similar native resolutions, so we took two benchmarks from the JinaVDR suite that contain very high-resolution, difficult-to-process images, and performed the same multi-vector embeddings test with them.
One is the europeana-de-news
benchmark, which contains high-resolution scans of German newspapers from the 17th to 20th centuries; the other is the wikimedia-commons-maps
benchmark, which contains very-high-resolution scans of printed maps, mostly from the pre-digital era.

wikimedia-commons-maps
. (Right) Front page of the Hamburger Nachrichten newspaper, dated 26 April 1819 and included in europeana-de-news
, scanned to 4324x4738 pixels.The results are very mixed, as you can see in Table 4.
Benchmark Dataset | 301,056 px | 602,112 px | 1,204,224 px | 2,408,448 px | 4,816,896 px | 9,633,792 px | 19,267,584 px |
---|---|---|---|---|---|---|---|
Max embedding vector size in tokens | 384 | 768 | 1,536 | 3,072 | 6,144 | 12,288 | 24,576 |
europeana-de-news |
0.46319 | 0.59457 | 0.66802 | 0.66550 | 0.66407 | 0.63948 | 0.65208 |
wikimedia-commons-maps |
0.23671 | 0.34421 | 0.42835 | 0.52268 | 0.53464 | 0.53464 | 0.53588 |
AVERAGE | 0.34995 | 0.46939 | 0.54819 | 0.59409 | 0.59936 | 0.58706 | 0.59398 |
The newspaper data clearly does not require the same resolution as the map data. This is likely because the text in the maps is very small compared to the overall image size and it really needs to be readable for retrieval to work. When the newspaper scans are sized larger than optimal, performance fluctuates and declines.
The average is suboptimal for both, suggesting that there is no one right resolution to solve the problem. It’s that discovery that motivates what we did next.
tagMulti-Resolution Embeddings
The VLM architecture of jina-embeddings-v4 treats individual patches of images the same way it treats text tokens, so it’s easy for us to augment high-resolution patches with low-resolution ones. It just means more input data, so long as we stay within the maximum that the model supports. If lower-resolution yields better semantics than high-resolution, we can just include them and not worry about what the optimal resolution is.
To test this hypothesis, we looked at three combinations of resolutions:
Mix 1 | Mix 2 | Mix 3 | |
---|---|---|---|
Maximum number of total tokens | 2,880 | 5,234 | 12,096 |
Resolutions | 150,528 px<br>301,056 px<br>602,112 px<br>1,204,224 px | 50,000 px<br>90,000 px<br>160,000 px<br>250,000 px<br>360,000 px<br>490,000 px<br>602,112 px<br>900,000 px<br>1,204,224 px | 150,528 px<br>301,056 px<br>602,112 px<br>1,204,224 px<br>2,408,448 px<br>4,816,896 px |
This means that when we tested Mix 1, we resized each image to four different resolutions — 150,528 px, 301,056 px, 602,112 px, 1,204,224 px — and processed each of them individually into multi-vector embeddings, then concatenated the results before doing query matching. Same for Mix 2 and Mix 3, but at different resolutions. We tried all three combinations on the ViDoRe v2 benchmarks, summarized in Table 6.
Benchmark Dataset | Best Single Resolution | Mix 1 | Mix 2 | Mix 3 |
---|---|---|---|---|
Max embedding vector size in tokens | — | 2,880 | 5,234 | 12,096 |
esg_reports_v2 |
0.54013 | 0.58354 | 0.59252 | 0.56567 |
biomedical_lectures_v2 |
0.61840 | 0.61678 | 0.61714 | 0.61638 |
economics_reports_v2 |
0.54998 | 0.54997 | 0.55534 | 0.55049 |
esg_reports_human_labeled_v2 |
0.64921 | 0.67726 | 0.68057 | 0.66734 |
AVERAGE | 0.58943 | 0.60689 | 0.61139 | 0.59997 |
For all ViDoRe v2 benchmarks, Mix 2 outperforms Mixes 1 and 3, suggesting that more different resolutions produce better results than adding higher resolutions. On two of four benchmarks, all mixtures of resolutions outperform the best single resolution, and for the remaining two, only Mix 1 underperforms, and then not by a large margin.
These results show that although there is no single best mix of resolutions for all data, identifying a good mix of resolutions is the right direction to go to find an optimal solution.
tagConclusion
Image resolution matters quite a lot for jina-embeddings-v4 when processing visually rich materials. One key issue is that text should be sized to be readable. What you can’t read, AI can’t read either, and if the text matters, then it has to be read.
But, too high a resolution and it becomes difficult for the embedding model to assemble image patches into a coherent whole. It’s also expensive: Higher resolutions mean more processing and, for multi-vector embeddings, more storage and slower matching.
Using multiple resolutions and applying late interaction-style scoring on all the outputs together is a good way to handle visually rich images with varying sizes. But, this adds to processing and storage costs, and slows down retrieval the same way that very large resolutions do.
We’re looking into ways we can operationalize this insight to improve neural search. We are constantly trying to diversify our training and test data to probe our models for shortcomings and devise ways to improve them. We are looking into the effects of resolution and multi-resolution techniques like the ones described here on various types of materials.
We are also doing experiments to see if multiple-resolution techniques like the ones described here mitigate the effect of noise in images and produce more robust retrieval.
Furthermore, it may be possible to automatically determine in advance the optimal resolution for each image. If we can reliably detect the best resolution, it would remove one more parameter for users while improving overall results, making embeddings more accessible and usable.