News
Models
Products
keyboard_arrow_down
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
More
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

MCP Server
Add mcp.jina.ai as your MCP server to access our API in LLMs
open_in_new
API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Terms & Conditions
Download logo
open_in_new



Log in
login
Architecture
Getting Started
Conclusion
star
Featured
Press release
December 04, 2025

Jina-VLM: Small Multilingual Vision Language Model

New 2B vision language model achieves SOTA on multilingual VQA, no catastrophic forgetting on text-only tasks.
Jina AI
Jina AI • 6 minutes read
Jina-VLM: Small Multilingual Vision Language Model
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.
arXiv.orgAndreas Koukounas
jinaai/jina-vlm · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.

We're releasing jina-vlm, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. By combining a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector, jina-vlm delivers strong performance across 29 languages while remaining efficient enough to run on consumer hardware.

Model Size VQA Avg MMMB Multi. MMB DocVQA OCRBench
jina-vlm 2.4B 72.3 78.8 74.3 90.6 778
Qwen2-VL-2B 2.1B 66.4 71.3 69.4 89.2 809
Qwen3-VL-2B 2.8B 71.6 75.0 72.3 92.3 858
InternVL3-2B 2.2B 69.2 73.6 71.9 87.4 835
InternVL3.5-2B 2.2B 71.6 74.6 70.9 88.5 836
Performance distribution across 6 languages on the MMMB benchmark: Arabic, Chinese, English, Portuguese, Russian, and Turkish. MMMB evaluates multilingual multimodal understanding through diverse visual question types.
Performance distribution across 6 languages on Multilingual MMBench: Arabic, Chinese, English, Portuguese, Russian, and Turkish. This benchmark tests cross-lingual visual reasoning and perception abilities.
Performance distribution across 8 visual question answering benchmarks: AI2D (diagrams), ChartQA (charts), TextVQA (scene text), DocVQA (documents), InfoVQA (infographics), OCRBench (OCR), SEED-2-Plus (diverse scenes), and CharXiv (scientific figures).
Performance distribution across 3 real-world understanding benchmarks: RealWorldQA (practical scenarios), MME-RealWorld (real-world perception), and R-Bench (robustness evaluation).
Performance distribution across 5 text-only benchmarks comparing jina-vlm against its Qwen3-1.7B backbone: MMLU (knowledge), MMLU-Pro (advanced reasoning), GSM-8K (math), ARC-C (science), and HellaSwag (commonsense).

tagArchitecture

Architecture of jina-vlm. Images are resized to fit a grid of up to 12 overlapping tiles, plus a global thumbnail. Each tile is a square 378x378 crop; adjacent tiles overlap by 112 pixels with a stride of 266 pixels between tile origins. A 4x3 grid therefore spans 1176x910 pixels, and images exceeding this effective resolution are downscaled to fit the tile budget. Each tile produces 729 patches via SigLIP2. The VL connector concatenates features from layers 24 and 18, the third- and ninth-to-last layers, then applies 2x2 attention pooling to reduce 729 tokens to 182 before projecting to the decoder dimension. Visual tokens are combined with text embeddings for the Qwen3 decoder.

Two challenges have limited practical VLM deployment: multilingual capabilities often degrade during vision adaptation, and high-quality VLMs remain computationally expensive. jina-vlm addresses both through careful architectural choices—our attention-pooling connector reduces visual tokens by 4× with minimal performance impact—and a training recipe that explicitly preserves multilingual capabilities.

The key architectural innovation is our vision-language connector. Rather than passing all 729 visual tokens per tile to the language model, we apply 2×2 attention pooling that reduces this to 182 tokens—a 4× reduction with minimal information loss. The connector works as follows:

  1. Multi-layer feature fusion: We concatenate features from ViT layers 18 and 24 (third-to-last and ninth-to-last), capturing both fine-grained spatial details and high-level semantics.
  2. Attention pooling: For each 2×2 patch neighborhood, we compute a query as the mean of neighborhood features, then apply cross-attention to produce a single pooled representation.
  3. SwiGLU projection: The pooled features are projected to the language model dimension via a gated linear unit.

This efficiency gain is described below:

Metric No Pooling With Pooling Reduction
Visual tokens (12 tiles + thumbnail) 9,477 2,366 4.0×
LLM prefill FLOPs 27.2 TFLOPs 6.9 TFLOPs 3.9×
KV-cache memory 2.12 GB 0.53 GB 4.0×

Since the ViT processes each tile identically regardless of pooling, these savings apply exclusively to the language model, which is the dominant cost during inference.

tagTraining Procedure

A common failure mode in VLM training is catastrophic forgetting: the language model loses its text-only capabilities as it adapts to visual inputs. This is particularly acute for multilingual models, where vision adaptation can degrade performance on non-English languages.

We address this through a two-stage training pipeline with explicit multilingual data and text-only preservation.

Stage 1: Alignment Training

The first stage focuses on cross-language semantic grounding using caption datasets spanning diverse visual domains: natural scenes, documents, infographics, and diagrams. Crucially, we include 15% text-only data to maintain the backbone's language understanding. The connector uses a higher learning rate (2e-4) and shorter warmup than the encoder and decoder, allowing it to adapt quickly while the pretrained components change gradually.

Stage 2: Instruction Fine-tuning

The second stage trains instruction-following for VQA and reasoning tasks. We combine public datasets covering academic VQA, document understanding, OCR, mathematics, and reasoning, with text-only instruction data to maintain language capabilities.

The combined data comprises approximately 5M multimodal samples and 12B text tokens across 29 languages, with roughly half in English and the remainder spanning Chinese, Arabic, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, Turkish, Vietnamese, Thai, Indonesian, Hindi, Bengali, and others.

tagGetting Started

tagVia CLI

The HuggingFace repository includes an infer.py script for quick experiments:

# Single image
python infer.py -i image.jpg -p "What's in this image?"

# Streaming output
python infer.py -i image.jpg -p "Describe this image" --stream

# Multiple images
python infer.py -i img1.jpg -i img2.jpg -p "Compare these images"

# Text-only
python infer.py -p "What is the capital of France?"

tagVia Transformers

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from PIL import Image

# Load model and processor
model = AutoModelForCausalLM.from_pretrained(
    "jinaai/jina-vlm",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "jinaai/jina-vlm",
    trust_remote_code=True
)

# Load an image
image = Image.open("document.png")

# Create the conversation
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is the main topic of this document?"}
        ]
    }
]

# Process and generate
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

tagConclusion

jina-vlm demonstrates that small VLMs can achieve strong cross-lingual visual understanding through careful architectural and training choices. The attention-pooling connector provides 4× token reduction with minimal performance impact, and incorporating text-only data during multimodal training preserves language capabilities that would otherwise degrade.

We note several limitations of the current approach:

  • Tiling overhead: Processing scales linearly with tile count. For very high-resolution images, this can become significant. Plus, tiling can impair tasks requiring holistic scene understanding, such as object counting or spatial reasoning across tile boundaries. The global thumbnail partially mitigates this, but native-resolution approaches may be better suited for such tasks.
  • Multi-image reasoning: Performance on multi-image benchmarks is weaker due to limited training data in this regime. Optimizing for concise visual responses appears to conflict with extended multi-step reasoning, as evidenced by MMLU-Pro degradation.

Future work could explore more efficient resolution handling, targeted improvements for counting and spatial tasks, and investigate whether our multilingual training recipe transfers to larger model scales.

Categories:
star
Featured
Press release
rss_feed

Read more
October 03, 2025 • 7 minutes read
Jina Reranker v3: 0.6B Listwise Reranker for SOTA Multilingual Retrieval
Jina AI
Light blue background with stylized text in the center, composed of small dots or squares, evoking a modern and minimalistic
September 04, 2025 • 6 minutes read
Jina Code Embeddings: SOTA Code Retrieval at 0.5B and 1.5B
Jina AI
Green "Code Embeddings" text displayed in a LED dot style on a black background, evoking a futuristic and technological atmos
July 25, 2025 • 8 minutes read
JinaVDR: New Visual Document Retrieval Benchmark with 95 Tasks in 20 Languages
Maximilian Werk
Alex C-G
Black-and-white design for "Jinavor Benchmark" with bold text. Below, "Visual Docs: 95 Tasks: 20 Languages" appears; an abstr
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
Search Foundation
Reader
Embeddings
Reranker
DeepSearch
Classifier
Segmenter
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
News
Intern program
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.