reader-lm-1.5b

A small language model for converting raw HTML into markdown

Release Post

License

CC-BY-NC-4.0

Release Date

2024-08-11

Input

Text (HTML)

Output

Text (Markdown)

Model Details

Parameters: 1.54B

Input Token Length: 256K

Language Support

🌍 Multilingual support

Related Models

reader-lm-0.5b

Overview

Reader LM 1.5B represents a breakthrough in efficient document processing, addressing the critical challenge of converting complex web content into clean, structured formats. This specialized language model tackles a fundamental problem in modern AI pipelines: the need to efficiently process and clean HTML content for downstream tasks without relying on brittle rule-based systems or resource-intensive large language models. What makes this model truly remarkable is its ability to outperform models 50 times its size while maintaining a surprisingly compact 1.54B parameter footprint. Organizations dealing with large-scale web content processing, documentation automation, or content management systems will find this model particularly valuable for its ability to handle extremely long documents while delivering superior accuracy in HTML-to-markdown conversion.

Methods

The model employs an innovative "shallow-but-wide" architecture that challenges traditional scaling approaches in language model design. At its core are 28 transformer layers configured with 12 query heads and 2 key-value heads, creating a unique balance that optimizes for selective-copy operations while maintaining deep semantic understanding. The architecture features a hidden size of 1536 and an intermediate size of 8960, carefully tuned to handle sequences up to 256K tokens. The training process involved two distinct stages: first focusing on short-and-simple HTML with 32K token sequences, then advancing to long-and-hard HTML with 128K tokens, implementing zigzag-ring-attention for efficient processing. This approach, combined with contrastive search and specialized repetition detection mechanisms, enables the model to avoid common issues like degeneration and dull loops that typically plague smaller language models handling complex document processing tasks.

Performance

In comprehensive benchmark evaluations, Reader LM 1.5B demonstrates exceptional capabilities that challenge industry standards. The model achieves a ROUGE-L score of 0.72 and a Token Error Rate of 0.19, significantly outperforming larger models like GPT-4 (0.43 ROUGE-L, 0.50 TER) and Gemini-1.5-Pro (0.42 ROUGE-L, 0.48 TER) in HTML-to-markdown conversion tasks. Its performance particularly shines in qualitative evaluations across four key dimensions: header extraction, main content extraction, rich structure preservation, and markdown syntax usage. The model consistently maintains high accuracy across diverse document types, from news articles and blog posts to landing pages and forum posts, in multiple languages including English, German, Japanese, and Chinese. This performance is achieved while processing documents up to 256K tokens in length, eliminating the need for expensive chunking operations that are typically required with larger models.

Best Practice

To effectively deploy Reader LM 1.5B, organizations should focus on scenarios involving complex HTML document processing where accuracy and efficiency are paramount. The model requires CUDA-capable GPU infrastructure for optimal performance, though its efficient architecture means it can run effectively on more modest hardware compared to larger alternatives. For production deployments, the model is available through both AWS SageMaker and Azure Marketplace, offering flexible integration options. While the model excels at HTML-to-markdown conversion, it's important to note that it's specifically optimized for this task and may not be suitable for general-purpose text generation or other NLP tasks. When processing extremely long documents (approaching 512K tokens), users should be aware that performance might degrade as this exceeds the model's training parameters. For optimal results, implement the provided repetition detection mechanisms and consider using contrastive search during inference to maintain output quality.

Blogs that mention this model

September 11, 2024 • 13 minutes read

Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown

Reader-LM-0.5B and Reader-LM-1.5B are two novel small language models inspired by Jina Reader, designed to convert raw, noisy HTML from the open web into clean markdown.

January 15, 2025 • 17 minutes read

ReaderLM v2: Frontier Small Language Model for HTML to Markdown and JSON

ReaderLM-v2 is a 1.5B small language model for HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional quality.