News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
DeepSearch
Search, read and reason until best answer found.
More
keyboard_arrow_down
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
Visual Language Models
Image Preprocessing in jina-embeddings-v4
Experiments with Image Resolution
Multi-Resolution Embeddings
Conclusion
Tech blog
July 31, 2025

How Image Resolution Impacts Visual Document Retrieval

Image resolution is crucial for embedding visually rich documents. Too small and models miss key details; too large and they can't connect the parts.
Maximilian Werk
Michael Günther
Scott Martens
Maximilian Werk, Michael Günther, Scott Martens • 12 minutes read

Traditional computer vision models typically focus on mimicking human visual perception. jina-embeddings-v4 takes a different approach: it combines image and text processing to understand how people read and interpret information presented visually. Unlike OCR programs that simply digitize text, it actually parses complex visual materials like infographics, charts, diagrams, and tables—documents where both the text and visual elements carry semantic meaning. We call these "visually rich documents."

Figure 1: Examples of visually rich documents from the JinaVDR benchmark collection.

If we just used OCR and text embeddings models, we would miss important information. If we used conventional embedding models trained on image-caption pairs, we wouldn’t be able to extract the semantics of the text. And in the case of things like tables, we have to know what the text means and process the spatial relationships between text elements to properly process them. We have to process them in visually-aware ways, which is why it’s called visual document retrieval.

Figure 2: A table rendered as an image, from the JinaVDR benchmark collection. Spatial relations between the words are very important to making sense of the table’s meaning.

But you have to be able to see things to make sense of them, and it’s no different for embedding models. Image quality counts.

Figure 3: A frame from the Patterson–Gimlin film. Is it Bigfoot? A blurry bear? A guy in a gorilla suit? Poor image quality makes it hard to say.

Image quality differences can come from many sources: photos are out of focus, lighting conditions are bad, movement causes blurring, and lossy compression algorithms destroy details. But visually rich documents are typically “born” digital. They consist of things like screenshots, presentation slides, and glossy PDFs — data that has been rendered into images by some presentation or publication process. Sometimes, they’re scans of actual paper pages, but if competently made, those images don’t suffer from the kinds of problems that are pervasive with pictures of real-world scenes.

Figure 4: The effect of changing image resolution. The smallest text is lost the fastest and we can’t expect embedding models to make sense of text that’s just a blur. Even the visual content is lost if the resolution is low enough.

For visually rich documents, the primary input quality problem is image resolution. Too small, and the information we need for retrieval is lost. But too big, and we flood the model with spurious details that undermine accurate processing. Right-sizing your inputs to jina-embeddings-v4 will save you money and improve your retrieval results.

This article looks at how jina-embeddings-v4 processes images and the effect of image resolution on its ability to retrieve visually rich documents. We’ve provided code in the JinaVDR repo on Hugging Face to reproduce our results or test your own data if you want to apply ideas from this article yourself.

GitHub - jina-ai/jina-vdr: Jina VDR is a multilingual, multi-domain benchmark for visual document retrieval
Jina VDR is a multilingual, multi-domain benchmark for visual document retrieval - jina-ai/jina-vdr
GitHubjina-ai

tagVisual Language Models

jina-embeddings-v4 is a visual language model (VLM) that extends the Qwen2.5-VL-3B-Instruct VLM. This approach to embedding images is different from single-modality image or text models and CLIP-style models like jina-clip-v2.

Figure 5 is a schematic of the jina-embeddings-v4 model architecture. The backbone model is similar to a conventional transformer-based embedding model, except that it supports dual output modes: a single-vector (dense vector) embedding produced by mean-pooling the final layer of the decoder, and a multi-vector (late interaction) output produced by a projection layer that is the same size as the input.

For more information about jina-embeddings-v4 and its two output modes, see the release notice and technical report.
Figure 5: The architecture of jina-embeddings-v4, from the version 4 Technical Report.

For texts, its input is the same as conventional text embedding models: Texts are tokenized, and the tokens are replaced by substituting vectors from a lookup table. These token vectors together serve as input to the model.

The innovation of VLMs is in their handling of images: A conventional image embedding model is connected to the text embedding model, but instead of producing an image embedding by mean pooling, its final layer becomes the input to the text embedding model, as if it were just a sequence of token vectors.

Figure 6: How jina-embeddings-v4 processes images compared to text.

This means that large images have the same problems for embedding models that long texts have. Longer inputs increase computational costs and encoding times and often produce uninformative embeddings. Small images have the opposite problem: If too much information is packed into one patch, it gets attenuated in the embedding process and gets lost in single-vector embeddings, while for multi-vector embeddings, it reduces the strength of matches. There is a trade-off between the amount of information provided to the model and its ability to accurately extract relevant information.

But texts can’t be resized; images can.

tagImage Preprocessing in jina-embeddings-v4

When you submit an image to jina-embeddings-v4, it’s split into patches of 28x28 pixels. This means that the number of patches is roughly proportional to the size of the image. If you use the Jina API, the maximum size supported is 602,112 pixels, but it’s a user-settable parameter in the model code if you download it.

The VLM architecture means that the number of 28x28 image patches the model can support is equal to the number of tokens of text it can support. That places an absolute upper bound on image size at roughly 20 megapixels.

Figure 7: How resolution affects the number of patches.

tagExperiments with Image Resolution

We evaluated different image resolutions using jina-embeddings-v4 with the ViDoRe v1 and v2 suites, and part of the JinaVDR benchmark suite, looking at both single-vector and multi-vector outputs. Adjusting the max_pixels parameter of the model to a range of values up to 19,267,584 pixels (i.e., 5,376x3,584) produces embeddings from different image resolutions. Where an image is already smaller than the value of max_pixels, it is not changed.

tagViDoRe v1

Table 1 reports single-vector (dense vector) embedding retrieval results for five resolutions on individual ViDoRe v1 benchmarks (average nDCG@5 score). We only report values up to 9,633,792 pixels because no images in ViDoRe v1 were larger than that.

Figure 8: Examples of documents from the ViDoRe v1 benchmark suite.
Benchmark Dataset 301,056 px 602,112 px 1,204,224 px 2,408,448 px 4,816,896 px 9,633,792 px
Max embedding vector size in tokens 384 768 1,536 3,072 6,144 12,288
arxivqa_test_subsampled 0.83487 0.84529 0.84537 0.83785 0.83439 0.83469
docvqa_test_subsampled 0.47366 0.50715 0.52421 0.51075 0.50287 0.50258
infovqa_test_subsampled 0.84404 0.87510 0.87890 0.87978 0.87672 0.87710
shiftproject_test 0.77524 0.81494 0.83988 0.84427 0.84127 0.84196
syntheticDocQA_artificial_intelligence_test 0.93809 0.96786 0.96655 0.97155 0.97024 0.97024
syntheticDocQA_energy_test 0.86865 0.89540 0.89847 0.91172 0.91286 0.91286
syntheticDocQA_government_reports_test 0.91708 0.93417 0.93865 0.92309 0.91609 0.91609
syntheticDocQA_healthcare_industry_test 0.93865 0.96428 0.96024 0.96542 0.95417 0.95286
tabfquad_test_subsampled 0.94298 0.94853 0.94502 0.94505 0.94505 0.94612
tatdqa_test 0.58622 0.64832 0.65867 0.65985 0.65395 0.65498
AVERAGE 0.81195 0.84010 0.84560 0.84493 0.84076 0.84095
When a higher resolution is tied with a lower one, it means there were no images in that benchmark larger than the lower resolution, so no resizing occurred.

You can see in Table 1 that the highest resolution is far from the best. 9.6 megapixels performed below the best resolution on tests, except the one with no images larger than 4.8 megapixels, where it got the same score because there was no resizing. The best performer on average was 1.2 megapixels, and the resolution with the most best scores on different benchmarks was 2.4 megapixels.

If we took the best score on each benchmark and averaged them, we would get 0.84905. This is only 0.345% better than the score for 1.2 megapixels.

Using multi-vector embeddings, as shown in Table 2, our retrieval scores are significantly higher. Multi-vector matching typically performs better than single-vector matching. However, the performance of different resolutions is roughly the same.

Benchmark Dataset 301,056 px 602,112 px 1,204,224 px 2,408,448 px 4,816,896 px 9,633,792 px
Max embedding vector size in tokens 384 768 1,536 3,072 6,144 12,288
arxivqa_test_subsampled 0.87456 0.88881 0.88736 0.88531 0.88899 0.89052
docvqa_test_subsampled 0.55344 0.61284 0.61123 0.59941 0.59087 0.59229
infovqa_test_subsampled 0.88777 0.92646 0.93376 0.94007 0.93459 0.93533
shiftproject_test 0.86224 0.90563 0.93547 0.92847 0.92240 0.92240
syntheticDocQA_artificial_intelligence_test 0.99631 0.99131 0.99500 0.99262 0.99262 0.99262
syntheticDocQA_energy_test 0.95216 0.96524 0.96524 0.96893 0.96762 0.96762
syntheticDocQA_government_reports_test 0.95934 0.97085 0.97524 0.98024 0.96655 0.96655
syntheticDocQA_healthcare_industry_test 0.97893 0.97893 0.99631 0.98524 0.98393 0.98393
tabfquad_test_subsampled 0.95386 0.95732 0.95611 0.95379 0.95379 0.95379
tatdqa_test 0.70547 0.78534 0.79516 0.80422 0.80552 0.80727
AVERAGE 0.87241 0.89827 0.90509 0.90383 0.90069 0.90123

Just like for single-vector embeddings, 1.2 megapixels resolution has the highest average score, and 2.4 megapixels is the best scoring resolution for the largest number of individual datasets. The average of best scores over all benchmarks is 0.90853, and the 1.2 megapixel resolution’s average score is 0.344% lower, almost exactly the same as for the single-vector case.

tagViDoRe v2

ViDoRe v2 uses much more detailed and colorful imagery than ViDoRe v1, which suggests that the optimal resolution will likely be higher.

Figure 9: Examples of ViDoRe v2 documents.

We tested multi-vector embeddings from jina-embeddings-v4 on the ViDoRe v2 benchmark suite, with results in Table 3. Single-vector results are similar but with lower scores, so we omit them here.

Benchmark Dataset 150,528 px 301,056 px 602,112 px 1,204,224 px 2,408,448 px 4,816,896 px 9,633,792 px
Max embedding vector size in tokens 192 384 768 1,536 3,072 6,144 12,288
esg_reports_v2 0.40444 0.54013 0.52005 0.51916 0.49953 0.52664 0.51442
biomedical_lectures_v2 0.58760 0.60479 0.6184 0.60748 0.60748 0.60748 0.60748
economics_reports_v2 0.47666 0.50399 0.54216 0.54998 0.54998 0.54998 0.54998
esg_reports_human_labeled_v2 0.42171 0.56940 0.61227 0.57307 0.61108 0.63858 0.64921
AVERAGE 0.47260 0.55458 0.57322 0.56242 0.56702 0.58067 0.58027

In this case, the best performer on average is 4.8 megapixel resolution, with average performance at 9.6 megapixel essentially identical. Nonetheless, for three of the four benchmarks, the highest resolution performed worse than lower resolutions, except for one test where no images were larger than 4.8 megapixels.

tagHigh-Resolution Benchmarks

ViDoRe v1 and ViDoRe v2 consist of images with similar native resolutions, so we took two benchmarks from the JinaVDR suite that contain very high-resolution, difficult-to-process images, and performed the same multi-vector embeddings test with them.

One is the europeana-de-news benchmark, which contains high-resolution scans of German newspapers from the 17th to 20th centuries; the other is the wikimedia-commons-maps benchmark, which contains very-high-resolution scans of printed maps, mostly from the pre-digital era.

Figure 10: (Left) A 4000x2968 pixel map from an 18th-century print, included in wikimedia-commons-maps. (Right) Front page of the Hamburger Nachrichten newspaper, dated 26 April 1819 and included in europeana-de-news, scanned to 4324x4738 pixels.

The results are very mixed, as you can see in Table 4.

Benchmark Dataset 301,056 px 602,112 px 1,204,224 px 2,408,448 px 4,816,896 px 9,633,792 px 19,267,584 px
Max embedding vector size in tokens 384 768 1,536 3,072 6,144 12,288 24,576
europeana-de-news 0.46319 0.59457 0.66802 0.66550 0.66407 0.63948 0.65208
wikimedia-commons-maps 0.23671 0.34421 0.42835 0.52268 0.53464 0.53464 0.53588
AVERAGE 0.34995 0.46939 0.54819 0.59409 0.59936 0.58706 0.59398

The newspaper data clearly does not require the same resolution as the map data. This is likely because the text in the maps is very small compared to the overall image size and it really needs to be readable for retrieval to work. When the newspaper scans are sized larger than optimal, performance fluctuates and declines.

The average is suboptimal for both, suggesting that there is no one right resolution to solve the problem. It’s that discovery that motivates what we did next.

tagMulti-Resolution Embeddings

The VLM architecture of jina-embeddings-v4 treats individual patches of images the same way it treats text tokens, so it’s easy for us to augment high-resolution patches with low-resolution ones. It just means more input data, so long as we stay within the maximum that the model supports. If lower-resolution yields better semantics than high-resolution, we can just include them and not worry about what the optimal resolution is.

To test this hypothesis, we looked at three combinations of resolutions:

Mix 1 Mix 2 Mix 3
Maximum number of total tokens 2,880 5,234 12,096
Resolutions 150,528 px<br>301,056 px<br>602,112 px<br>1,204,224 px 50,000 px<br>90,000 px<br>160,000 px<br>250,000 px<br>360,000 px<br>490,000 px<br>602,112 px<br>900,000 px<br>1,204,224 px 150,528 px<br>301,056 px<br>602,112 px<br>1,204,224 px<br>2,408,448 px<br>4,816,896 px

This means that when we tested Mix 1, we resized each image to four different resolutions — 150,528 px, 301,056 px, 602,112 px, 1,204,224 px — and processed each of them individually into multi-vector embeddings, then concatenated the results before doing query matching. Same for Mix 2 and Mix 3, but at different resolutions. We tried all three combinations on the ViDoRe v2 benchmarks, summarized in Table 6.

Benchmark Dataset Best Single Resolution Mix 1 Mix 2 Mix 3
Max embedding vector size in tokens — 2,880 5,234 12,096
esg_reports_v2 0.54013 0.58354 0.59252 0.56567
biomedical_lectures_v2 0.61840 0.61678 0.61714 0.61638
economics_reports_v2 0.54998 0.54997 0.55534 0.55049
esg_reports_human_labeled_v2 0.64921 0.67726 0.68057 0.66734
AVERAGE 0.58943 0.60689 0.61139 0.59997

For all ViDoRe v2 benchmarks, Mix 2 outperforms Mixes 1 and 3, suggesting that more different resolutions produce better results than adding higher resolutions. On two of four benchmarks, all mixtures of resolutions outperform the best single resolution, and for the remaining two, only Mix 1 underperforms, and then not by a large margin.

These results show that although there is no single best mix of resolutions for all data, identifying a good mix of resolutions is the right direction to go to find an optimal solution.

tagConclusion

Image resolution matters quite a lot for jina-embeddings-v4 when processing visually rich materials. One key issue is that text should be sized to be readable. What you can’t read, AI can’t read either, and if the text matters, then it has to be read.

But, too high a resolution and it becomes difficult for the embedding model to assemble image patches into a coherent whole. It’s also expensive: Higher resolutions mean more processing and, for multi-vector embeddings, more storage and slower matching.

Using multiple resolutions and applying late interaction-style scoring on all the outputs together is a good way to handle visually rich images with varying sizes. But, this adds to processing and storage costs, and slows down retrieval the same way that very large resolutions do.

We’re looking into ways we can operationalize this insight to improve neural search. We are constantly trying to diversify our training and test data to probe our models for shortcomings and devise ways to improve them. We are looking into the effects of resolution and multi-resolution techniques like the ones described here on various types of materials.

We are also doing experiments to see if multiple-resolution techniques like the ones described here mitigate the effect of noise in images and produce more robust retrieval.

Furthermore, it may be possible to automatically determine in advance the optimal resolution for each image. If we can reliably detect the best resolution, it would remove one more parameter for users while improving overall results, making embeddings more accessible and usable.

Categories:
Tech blog
rss_feed

Read more
July 14, 2025 • 11 minutes read
Submodular Optimization for Text Selection, Passage Reranking & Context Engineering
Han Xiao
Network illustration of interconnected hexagons, some solid and some hollow blue, connected by red lines indicating paths or
July 04, 2025 • 13 minutes read
Submodular Optimization for Diverse Query Generation in DeepResearch
Han Xiao
Black and white typographic design of "1993" with a 3D effect, minimalistic black border, and a sense of depth on a white bac
June 30, 2025 • 8 minutes read
Quantization-Aware Training of jina-embeddings-v4
Andrei Ungureanu
Scott Martens
Bo Wang
Retro-style digital screen displaying four pixelated images: a cat, a woman, an abstract figure, and a man's portrait, with l
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.