ReaderLM-v2

Frontier small language model for converting raw HTML into markdown or JSON

License

CC-BY-NC-4.0

Release Date

2025-01-16

Input

Text (HTML)

Output

Text (Markdown)

Text (JSON)

Model Details

Parameters: 1.54B

Input Token Length: 512K

Language Support

🌍 Multilingual support

Related Models

reader-lm-1.5b

Overview

ReaderLM-v2 is a 1.5B parameter language model that converts raw HTML into markdown or JSON, handling up to 512K tokens combined input/output length with support for 29 languages. Unlike its predecessor that treated HTML-to-markdown as a 'selective-copy' task, v2 approaches it as a translation process, enabling superior handling of complex elements like code fences, nested lists, tables, and LaTeX equations. The model maintains consistent performance across varying context lengths and introduces direct HTML-to-JSON generation capabilities with predefined schemas.

Methods

Built on Qwen2.5-1.5B-Instruction, ReaderLM-v2's training involved a html-markdown-1m dataset of ten millions HTML documents, averaging 56,000 tokens each. The training process included: 1) long-context pretraining using ring-zag attention and RoPE to expand context from 32K to 256K tokens, 2) supervised fine-tuning with refined datasets, 3) direct preference optimization for output alignment, and 4) self-play reinforcement tuning. Data preparation followed a three-step pipeline (Draft-Refine-Critique) powered by Qwen2.5-32B-Instruction, with specialized models trained for specific tasks before merging via linear parameter interpolation.

Performance

In comprehensive benchmarks, ReaderLM-v2 outperforms larger models like Qwen2.5-32B-Instruct and Gemini2-flash-expr on HTML-to-Markdown tasks. For main content extraction, it achieves ROUGE-L of 0.84, Jaro-Winkler of 0.82, and significantly lower Levenshtein distance (0.22) compared to competitors. In HTML-to-JSON tasks, it maintains competitive performance with F1 scores of 0.81 and 98% pass rate. The model processes at 67 tokens/s input and 36 tokens/s output on a T4 GPU, with significantly reduced degeneration issues through contrastive loss training.

Best Practice

The model is accessible through a Google Colab notebook demonstrating HTML-to-markdown conversion, JSON extraction, and instruction-following. For HTML-to-Markdown tasks, users can input raw HTML without prefix instructions, while JSON extraction requires specific schema formatting. The create_prompt helper function facilitates easy prompt creation for both tasks. While the model works on Colab's free T4 GPU tier (requiring vllm and triton), it has limitations without bfloat16 or flash attention 2 support. RTX 3090/4090 is recommended for production use. The model will be available on AWS SageMaker, Azure, and GCP marketplace, licensed under CC BY-NC 4.0 for non-commercial use.

Blogs that mention this model

March 04, 2025

ReaderLM-v2: Small Language Model for HTML to Markdown and JSON

We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements.

ICLR 2025

January 15, 2025 • 17 minutes read

ReaderLM v2: Frontier Small Language Model for HTML to Markdown and JSON

ReaderLM-v2 is a 1.5B small language model for HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional quality.