News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

MCP Server
Add mcp.jina.ai as your MCP server to access our API in LLMs
open_in_new
API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Terms & Conditions
Download logo
open_in_new



Log in
login
Reader
copyright CC BY-NC 4.0
open_in_new Release Post

jina-vlm

Multilingual vision-language model for visual question answering
License
copyright CC-BY-NC-4.0
Release Date
calendar_month
2025-12-04
Input
image
Image
abc
Text
arrow_forward
Output
abc
Text
Model Details
Parameters: 2.4B
Input Token Length: 32K
Input Image Size: 4096×4096
Language Support
🌍 Multilingual support
Related Models
link
jina-embeddings-v4
link
jina-reranker-m0
Tags
reader
vlm
multilingual
vision-language
image-to-text
document-processing
ocr
Available via
Commercial LicenseHugging Face
Choose models to compare
Publications (1)
arXiv
December 04, 2025
Jina-VLM: Small Multilingual Vision Language Model

Overview

jina-vlm is a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2-So400M vision encoder (449M parameters) with a Qwen3-1.7B language backbone through an attention-pooling connector that reduces visual tokens by 4× while preserving spatial information. Using overlapping image tiling with 12 tiles plus a global thumbnail, it processes images of arbitrary resolution up to 4K. Training data comprises approximately 5M multimodal samples and 12B text tokens across 29 languages, with roughly half in English and the remainder spanning high- and moderate-resource languages including Chinese, Arabic, German, Spanish, French, Italian, Japanese, Korean, and more.

Methods

Training proceeds in two stages with all model components (encoder, connector, decoder) updated without freezing. Stage 1 (alignment training) focuses on cross-language semantic grounding using caption datasets (PixmoCap, PangeaIns) spanning natural scenes, documents, infographics, and diagrams, with 15% text-only data to mitigate degradation on text-only tasks. The connector uses higher learning rate and shorter warmup than encoder and decoder. Stage 2 (instruction tuning) adapts the model to conversational VQA using multilingual instruction-response datasets (Aya, ShareGPT4V, LLaVA). The attention-pooling connector applies 2×2 pooling to reduce 729 visual tokens per tile to 182 tokens, achieving 4× token reduction with minimal performance loss. Overlapping tiling with 50% overlap and 378×378 tiles preserves edge information.

Performance

Achieves the highest average score (72.3) across eight VQA benchmarks among 2B-scale VLMs, including MathVista (59.4), AI2D (80.8), ChartQA (79.5), DocVQA (88.9), InfoVQA (65.9), RealWorldQA (64.9), OCRBench (759), and MME (1582). Leads on multilingual multimodal understanding with MMMB (78.8) and Multilingual MMBench (74.3), covering Arabic, Chinese, English, Portuguese, Russian, and Turkish. Strong OCR performance with 759 on OCRBench (0-1000 scale). Competitive text-only performance on MMLU (54.7) and HellaSwag (75.6), though shows expected degradation on MMLU-Pro (30.3 vs 46.4 base) due to vision-language integration. The 4× token reduction from attention pooling yields 3.9× reduction in LLM prefill FLOPs and 4× reduction in KV-cache memory with minimal impact on benchmark scores.

Best Practice

The model is available on Hugging Face under CC-BY-NC-4.0 license with weights and inference code. Supports images of arbitrary resolution through automatic tiling (up to 12 tiles plus thumbnail). Use thinking mode by enabling do_sample=True and temperature > 0 for complex reasoning tasks. The model handles 32K context length for extended conversations. For multilingual VQA, the model supports 29 languages including English, Chinese, Arabic, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, Turkish, Vietnamese, Thai, Indonesian, Hindi, and Bengali. Best suited for document understanding, chart/diagram analysis, OCR tasks, and multilingual visual question answering. The model shows limitations on counting tasks and fine-grained spatial reasoning due to tiling approach. For optimal inference, use bfloat16 precision on CUDA-capable GPUs.
Blogs that mention this model
December 04, 2025
Jina-VLM: Small Multilingual Vision Language Model
We present jina-vlm, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, jina-vlm achieves leading results while preserving competitive text-only performance. Model weights and code are publicly released.
Jina-VLM: Small Multilingual Vision Language Model
arXiv
December 04, 2025 • 6 minutes read
Jina-VLM: Small Multilingual Vision Language Model
New 2B vision language model achieves SOTA on multilingual VQA, no catastrophic forgetting on text-only tasks.
Jina AI
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
News
Intern program
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.