News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
copyright

ReaderLM-v2

Frontier small language model for converting raw HTML into markdown or JSON
Release Postarrow_forward
License
copyright
CC-BY-NC-4.0
Release Date
calendar_month
2025-01-16
Input
abc
Text (HTML)
arrow_forward
Output
abc
Text (Markdown)
abc
Text (JSON)
Model Details
Parameters: 1.54B
Input Token Length: 512K
Language Support
🌍 Multilingual support
Related Models
link
reader-lm-1.5b
Tags
reader
language-model
multilingual
document-processing
long-context
text-understanding
content-extraction
cross-lingual
Available via
Jina APICommercial LicenseAWS SageMakerMicrosoft AzureGoogle CloudHugging Face
I/O graph 1
I/O graph 2
I/O graph 3
Choose models to compare
Publications (1)
ICLR 2025
March 04, 2025
ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

Overview

ReaderLM-v2 is a 1.5B parameter language model that converts raw HTML into markdown or JSON, handling up to 512K tokens combined input/output length with support for 29 languages. Unlike its predecessor that treated HTML-to-markdown as a 'selective-copy' task, v2 approaches it as a translation process, enabling superior handling of complex elements like code fences, nested lists, tables, and LaTeX equations. The model maintains consistent performance across varying context lengths and introduces direct HTML-to-JSON generation capabilities with predefined schemas.

Methods

Built on Qwen2.5-1.5B-Instruction, ReaderLM-v2's training involved a html-markdown-1m dataset of ten millions HTML documents, averaging 56,000 tokens each. The training process included: 1) long-context pretraining using ring-zag attention and RoPE to expand context from 32K to 256K tokens, 2) supervised fine-tuning with refined datasets, 3) direct preference optimization for output alignment, and 4) self-play reinforcement tuning. Data preparation followed a three-step pipeline (Draft-Refine-Critique) powered by Qwen2.5-32B-Instruction, with specialized models trained for specific tasks before merging via linear parameter interpolation.

Performance

In comprehensive benchmarks, ReaderLM-v2 outperforms larger models like Qwen2.5-32B-Instruct and Gemini2-flash-expr on HTML-to-Markdown tasks. For main content extraction, it achieves ROUGE-L of 0.84, Jaro-Winkler of 0.82, and significantly lower Levenshtein distance (0.22) compared to competitors. In HTML-to-JSON tasks, it maintains competitive performance with F1 scores of 0.81 and 98% pass rate. The model processes at 67 tokens/s input and 36 tokens/s output on a T4 GPU, with significantly reduced degeneration issues through contrastive loss training.

Best Practice

The model is accessible through a Google Colab notebook demonstrating HTML-to-markdown conversion, JSON extraction, and instruction-following. For HTML-to-Markdown tasks, users can input raw HTML without prefix instructions, while JSON extraction requires specific schema formatting. The create_prompt helper function facilitates easy prompt creation for both tasks. While the model works on Colab's free T4 GPU tier (requiring vllm and triton), it has limitations without bfloat16 or flash attention 2 support. RTX 3090/4090 is recommended for production use. The model will be available on AWS SageMaker, Azure, and GCP marketplace, licensed under CC BY-NC 4.0 for non-commercial use.
Blogs that mention this model
January 15, 2025 • 17 minutes read
ReaderLM v2: Frontier Small Language Model for HTML to Markdown and JSON
ReaderLM-v2 is a 1.5B small language model for HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional quality.
Jina AI
Orange text "ReaderLM-u2" on a vibrant dark red digital screen.
April 08, 2025 • 21 minutes read
jina-reranker-m0: Multilingual Multimodal Document Reranker
Introducing jina-reranker-m0, our new multilingual multimodal reranker for retrieving visual documents, with SOTA performance on multilingual long documents and code searching tasks.
Jina AI
Modern dot matrix text display on a dark blue background, conveying a digital feel.
January 31, 2025 • 14 minutes read
A Practical Guide to Deploying Search Foundation Models in Production
We offer detailed cost and performance breakdowns for three deployment strategies: Jina API, self-hosted K8s, and AWS SageMaker, to help you make the right decision.
Saahil Ognawala
Scott Martens
Abstract cityscape illustration with orange, grey and white buildings, featuring visible balconies with a potted plant.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.