Available via
Choose models to compare
Publications (1)
Overview
jina-vlm is a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2-So400M vision encoder (449M parameters) with a Qwen3-1.7B language backbone through an attention-pooling connector that reduces visual tokens by 4× while preserving spatial information. Using overlapping image tiling with 12 tiles plus a global thumbnail, it processes images of arbitrary resolution up to 4K. Training data comprises approximately 5M multimodal samples and 12B text tokens across 29 languages, with roughly half in English and the remainder spanning high- and moderate-resource languages including Chinese, Arabic, German, Spanish, French, Italian, Japanese, Korean, and more.
Methods
Training proceeds in two stages with all model components (encoder, connector, decoder) updated without freezing. Stage 1 (alignment training) focuses on cross-language semantic grounding using caption datasets (PixmoCap, PangeaIns) spanning natural scenes, documents, infographics, and diagrams, with 15% text-only data to mitigate degradation on text-only tasks. The connector uses higher learning rate and shorter warmup than encoder and decoder. Stage 2 (instruction tuning) adapts the model to conversational VQA using multilingual instruction-response datasets (Aya, ShareGPT4V, LLaVA). The attention-pooling connector applies 2×2 pooling to reduce 729 visual tokens per tile to 182 tokens, achieving 4× token reduction with minimal performance loss. Overlapping tiling with 50% overlap and 378×378 tiles preserves edge information.
Performance
Achieves the highest average score (72.3) across eight VQA benchmarks among 2B-scale VLMs, including MathVista (59.4), AI2D (80.8), ChartQA (79.5), DocVQA (88.9), InfoVQA (65.9), RealWorldQA (64.9), OCRBench (759), and MME (1582). Leads on multilingual multimodal understanding with MMMB (78.8) and Multilingual MMBench (74.3), covering Arabic, Chinese, English, Portuguese, Russian, and Turkish. Strong OCR performance with 759 on OCRBench (0-1000 scale). Competitive text-only performance on MMLU (54.7) and HellaSwag (75.6), though shows expected degradation on MMLU-Pro (30.3 vs 46.4 base) due to vision-language integration. The 4× token reduction from attention pooling yields 3.9× reduction in LLM prefill FLOPs and 4× reduction in KV-cache memory with minimal impact on benchmark scores.
Best Practice
The model is available on Hugging Face under CC-BY-NC-4.0 license with weights and inference code. Supports images of arbitrary resolution through automatic tiling (up to 12 tiles plus thumbnail). Use thinking mode by enabling do_sample=True and temperature > 0 for complex reasoning tasks. The model handles 32K context length for extended conversations. For multilingual VQA, the model supports 29 languages including English, Chinese, Arabic, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, Turkish, Vietnamese, Thai, Indonesian, Hindi, and Bengali. Best suited for document understanding, chart/diagram analysis, OCR tasks, and multilingual visual question answering. The model shows limitations on counting tasks and fine-grained spatial reasoning due to tiling approach. For optimal inference, use bfloat16 precision on CUDA-capable GPUs.
Blogs that mention this model


