

We're releasing jina-vlm, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. By combining a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector, jina-vlm delivers strong performance across 29 languages while remaining efficient enough to run on consumer hardware.
| Model | Size | VQA Avg | MMMB | Multi. MMB | DocVQA | OCRBench |
|---|---|---|---|---|---|---|
| jina-vlm | 2.4B | 72.3 | 78.8 | 74.3 | 90.6 | 778 |
| Qwen2-VL-2B | 2.1B | 66.4 | 71.3 | 69.4 | 89.2 | 809 |
| Qwen3-VL-2B | 2.8B | 71.6 | 75.0 | 72.3 | 92.3 | 858 |
| InternVL3-2B | 2.2B | 69.2 | 73.6 | 71.9 | 87.4 | 835 |
| InternVL3.5-2B | 2.2B | 71.6 | 74.6 | 70.9 | 88.5 | 836 |





tagArchitecture
Two challenges have limited practical VLM deployment: multilingual capabilities often degrade during vision adaptation, and high-quality VLMs remain computationally expensive. jina-vlm addresses both through careful architectural choices—our attention-pooling connector reduces visual tokens by 4× with minimal performance impact—and a training recipe that explicitly preserves multilingual capabilities.
The key architectural innovation is our vision-language connector. Rather than passing all 729 visual tokens per tile to the language model, we apply 2×2 attention pooling that reduces this to 182 tokens—a 4× reduction with minimal information loss. The connector works as follows:
- Multi-layer feature fusion: We concatenate features from ViT layers 18 and 24 (third-to-last and ninth-to-last), capturing both fine-grained spatial details and high-level semantics.
- Attention pooling: For each 2×2 patch neighborhood, we compute a query as the mean of neighborhood features, then apply cross-attention to produce a single pooled representation.
- SwiGLU projection: The pooled features are projected to the language model dimension via a gated linear unit.
This efficiency gain is described below:
| Metric | No Pooling | With Pooling | Reduction |
|---|---|---|---|
| Visual tokens (12 tiles + thumbnail) | 9,477 | 2,366 | 4.0× |
| LLM prefill FLOPs | 27.2 TFLOPs | 6.9 TFLOPs | 3.9× |
| KV-cache memory | 2.12 GB | 0.53 GB | 4.0× |
Since the ViT processes each tile identically regardless of pooling, these savings apply exclusively to the language model, which is the dominant cost during inference.
tagTraining Procedure
A common failure mode in VLM training is catastrophic forgetting: the language model loses its text-only capabilities as it adapts to visual inputs. This is particularly acute for multilingual models, where vision adaptation can degrade performance on non-English languages.
We address this through a two-stage training pipeline with explicit multilingual data and text-only preservation.
Stage 1: Alignment Training
The first stage focuses on cross-language semantic grounding using caption datasets spanning diverse visual domains: natural scenes, documents, infographics, and diagrams. Crucially, we include 15% text-only data to maintain the backbone's language understanding. The connector uses a higher learning rate (2e-4) and shorter warmup than the encoder and decoder, allowing it to adapt quickly while the pretrained components change gradually.
Stage 2: Instruction Fine-tuning
The second stage trains instruction-following for VQA and reasoning tasks. We combine public datasets covering academic VQA, document understanding, OCR, mathematics, and reasoning, with text-only instruction data to maintain language capabilities.
The combined data comprises approximately 5M multimodal samples and 12B text tokens across 29 languages, with roughly half in English and the remainder spanning Chinese, Arabic, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, Turkish, Vietnamese, Thai, Indonesian, Hindi, Bengali, and others.
tagGetting Started
tagVia CLI
The HuggingFace repository includes an infer.py script for quick experiments:
# Single image
python infer.py -i image.jpg -p "What's in this image?"
# Streaming output
python infer.py -i image.jpg -p "Describe this image" --stream
# Multiple images
python infer.py -i img1.jpg -i img2.jpg -p "Compare these images"
# Text-only
python infer.py -p "What is the capital of France?"tagVia Transformers
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from PIL import Image
# Load model and processor
model = AutoModelForCausalLM.from_pretrained(
"jinaai/jina-vlm",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(
"jinaai/jina-vlm",
trust_remote_code=True
)
# Load an image
image = Image.open("document.png")
# Create the conversation
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "What is the main topic of this document?"}
]
}
]
# Process and generate
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)tagConclusion
jina-vlm demonstrates that small VLMs can achieve strong cross-lingual visual understanding through careful architectural and training choices. The attention-pooling connector provides 4× token reduction with minimal performance impact, and incorporating text-only data during multimodal training preserves language capabilities that would otherwise degrade.
We note several limitations of the current approach:
- Tiling overhead: Processing scales linearly with tile count. For very high-resolution images, this can become significant. Plus, tiling can impair tasks requiring holistic scene understanding, such as object counting or spatial reasoning across tile boundaries. The global thumbnail partially mitigates this, but native-resolution approaches may be better suited for such tasks.
- Multi-image reasoning: Performance on multi-image benchmarks is weaker due to limited training data in this regime. Optimizing for concise visual responses appears to conflict with extended multi-step reasoning, as evidenced by MMLU-Pro degradation.
Future work could explore more efficient resolution handling, targeted improvements for counting and spatial tasks, and investigate whether our multilingual training recipe transfers to larger model scales.







