Jina-VLM: Small Multilingual Vision Language Model

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.

arXiv.orgAndreas Koukounas

We're releasing jina-vlm, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. By combining a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector, jina-vlm delivers strong performance across 29 languages while remaining efficient enough to run on consumer hardware.

Model	Size	VQA Avg	MMMB	Multi. MMB	DocVQA	OCRBench
jina-vlm	2.4B	72.3	78.8	74.3	90.6	778
Qwen2-VL-2B	2.1B	66.4	71.3	69.4	89.2	809
Qwen3-VL-2B	2.8B	71.6	75.0	72.3	92.3	858
InternVL3-2B	2.2B	69.2	73.6	71.9	87.4	835
InternVL3.5-2B	2.2B	71.6	74.6	70.9	88.5	836

Box plot showing distribution and variability of Multilingual Understanding (MMMB) scores among participant groups, highlight — Performance distribution across 6 languages on the MMMB benchmark: Arabic, Chinese, English, Portuguese, Russian, and Turkish. MMMB evaluates multilingual multimodal understanding through diverse visual question types.

Graph compares scores of different datasets under Multilingual MMbench, illustrating data through horizontal bars on a score — Performance distribution across 6 languages on Multilingual MMBench: Arabic, Chinese, English, Portuguese, Russian, and Turkish. This benchmark tests cross-lingual visual reasoning and perception abilities.

Bar chart comparing performance scores of different VQA models like jina, InternV, and Owen, ranging from 0 to 100. — Performance distribution across 8 visual question answering benchmarks: AI2D (diagrams), ChartQA (charts), TextVQA (scene text), DocVQA (documents), InfoVQA (infographics), OCRBench (OCR), SEED-2-Plus (diverse scenes), and CharXiv (scientific figures).

Horizontal bar chart titled "Real-World Understanding", displaying scores from 35 to 70 across five bars, with the highest be — Performance distribution across 3 real-world understanding benchmarks: RealWorldQA (practical scenarios), MME-RealWorld (real-world perception), and R-Bench (robustness evaluation).

Bar chart showing text-only performance comparison. Package Qwa3-1.7B with higher score than jina, on a scale from 30 to 80. — Performance distribution across 5 text-only benchmarks comparing jina-vlm against its Qwen3-1.7B backbone: MMLU (knowledge), MMLU-Pro (advanced reasoning), GSM-8K (math), ARC-C (science), and HellaSwag (commonsense).

tagArchitecture

A detailed flowchart of the JNA/Jina system, illustrating the encoding and decoding process stages with technical complexity, — Architecture of jina-vlm. Images are resized to fit a grid of up to 12 overlapping tiles, plus a global thumbnail. Each tile is a square 378x378 crop; adjacent tiles overlap by 112 pixels with a stride of 266 pixels between tile origins. A 4x3 grid therefore spans 1176x910 pixels, and images exceeding this effective resolution are downscaled to fit the tile budget. Each tile produces 729 patches via SigLIP2. The VL connector concatenates features from layers 24 and 18, the third- and ninth-to-last layers, then applies 2x2 attention pooling to reduce 729 tokens to 182 before projecting to the decoder dimension. Visual tokens are combined with text embeddings for the Qwen3 decoder.

Two challenges have limited practical VLM deployment: multilingual capabilities often degrade during vision adaptation, and high-quality VLMs remain computationally expensive. jina-vlm addresses both through careful architectural choices—our attention-pooling connector reduces visual tokens by 4× with minimal performance impact—and a training recipe that explicitly preserves multilingual capabilities.

The key architectural innovation is our vision-language connector. Rather than passing all 729 visual tokens per tile to the language model, we apply 2×2 attention pooling that reduces this to 182 tokens—a 4× reduction with minimal information loss. The connector works as follows:

Multi-layer feature fusion: We concatenate features from ViT layers 18 and 24 (third-to-last and ninth-to-last), capturing both fine-grained spatial details and high-level semantics.
Attention pooling: For each 2×2 patch neighborhood, we compute a query as the mean of neighborhood features, then apply cross-attention to produce a single pooled representation.
SwiGLU projection: The pooled features are projected to the language model dimension via a gated linear unit.

This efficiency gain is described below:

Metric	No Pooling	With Pooling	Reduction
Visual tokens (12 tiles + thumbnail)	9,477	2,366	4.0×
LLM prefill FLOPs	27.2 TFLOPs	6.9 TFLOPs	3.9×
KV-cache memory	2.12 GB	0.53 GB	4.0×

Since the ViT processes each tile identically regardless of pooling, these savings apply exclusively to the language model, which is the dominant cost during inference.

tagTraining Procedure

A common failure mode in VLM training is catastrophic forgetting: the language model loses its text-only capabilities as it adapts to visual inputs. This is particularly acute for multilingual models, where vision adaptation can degrade performance on non-English languages.

We address this through a two-stage training pipeline with explicit multilingual data and text-only preservation.

Stage 1: Alignment Training

The first stage focuses on cross-language semantic grounding using caption datasets spanning diverse visual domains: natural scenes, documents, infographics, and diagrams. Crucially, we include 15% text-only data to maintain the backbone's language understanding. The connector uses a higher learning rate (2e-4) and shorter warmup than the encoder and decoder, allowing it to adapt quickly while the pretrained components change gradually.

Stage 2: Instruction Fine-tuning

The second stage trains instruction-following for VQA and reasoning tasks. We combine public datasets covering academic VQA, document understanding, OCR, mathematics, and reasoning, with text-only instruction data to maintain language capabilities.

The combined data comprises approximately 5M multimodal samples and 12B text tokens across 29 languages, with roughly half in English and the remainder spanning Chinese, Arabic, German, Spanish, French, Italian, Japanese, Korean, Portuguese, Russian, Turkish, Vietnamese, Thai, Indonesian, Hindi, Bengali, and others.

tagGetting Started

tagVia Jina API

We provide an OpenAI-compatible API at https://api-beta-vlm.jina.ai.

tagImage from URL

Format	Example
HTTP/HTTPS URL	`https://example.com/image.jpg`
Base64 data URI	`data:image/jpeg;base64,/9j/4AAQ...`

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}}
      ]
    }]
  }'

Local image (base64)

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,'$(base64 -i image.jpg)'"}}
      ]
    }]
  }'

Text-only query

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Streaming response

Add "stream": true to receive tokens as they're generated:

curl https://api-beta-vlm.jina.ai/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $JINA_API_KEY" \
  -d '{
    "model": "jina-vlm",
    "stream": true,
    "messages": [{"role": "user", "content": "Write a haiku about coding"}]
  }'

When the service is cold starting, you'll receive:

{
  "error": {
    "message": "Model is loading, please retry in 30-60 seconds. Cold start takes ~30s after the service scales up.",
    "code": 503
  }
}

Simply retry your request after waiting.

tagVia CLI

The HuggingFace repository includes an infer.py script for quick experiments:

# Single image
python infer.py -i image.jpg -p "What's in this image?"

# Streaming output
python infer.py -i image.jpg -p "Describe this image" --stream

# Multiple images
python infer.py -i img1.jpg -i img2.jpg -p "Compare these images"

# Text-only
python infer.py -p "What is the capital of France?"

tagVia Transformers

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
from PIL import Image

# Load model and processor
model = AutoModelForCausalLM.from_pretrained(
    "jinaai/jina-vlm",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(
    "jinaai/jina-vlm",
    trust_remote_code=True
)

# Load an image
image = Image.open("document.png")

# Create the conversation
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is the main topic of this document?"}
        ]
    }
]

# Process and generate
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

tagConclusion

jina-vlm demonstrates that small VLMs can achieve strong cross-lingual visual understanding through careful architectural and training choices. The attention-pooling connector provides 4× token reduction with minimal performance impact, and incorporating text-only data during multimodal training preserves language capabilities that would otherwise degrade.

We note several limitations of the current approach:

Tiling overhead: Processing scales linearly with tile count. For very high-resolution images, this can become significant. Plus, tiling can impair tasks requiring holistic scene understanding, such as object counting or spatial reasoning across tile boundaries. The global thumbnail partially mitigates this, but native-resolution approaches may be better suited for such tasks.
Multi-image reasoning: Performance on multi-image benchmarks is weaker due to limited training data in this regime. Optimizing for concise visual responses appears to conflict with extended multi-step reasoning, as evidenced by MMLU-Pro degradation.

Future work could explore more efficient resolution handling, targeted improvements for counting and spatial tasks, and investigate whether our multilingual training recipe transfers to larger model scales.