News
Models
Products
keyboard_arrow_down
DeepSearch
Search, read and reason until best answer found.
Reader
Convert any URL to Markdown for better grounding LLMs.
Embeddings
World-class multimodal multilingual embeddings.
Reranker
World-class reranker for maximizing search relevancy.
Classifier
Zero-shot and few-shot classification for image and text.
Segmenter
Cut long text into chunks and do tokenization.

API Docs
Auto codegen for your copilot IDE or LLM
open_in_new


Company
keyboard_arrow_down
About us
Contact sales
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms & Conditions


Log in
login
warning
This model is deprecated by newer models.
copyright

reader-lm-1.5b

A small language model for converting raw HTML into markdown
Release Postarrow_forward
License
copyright
CC-BY-NC-4.0
Release Date
calendar_month
2024-08-11
Input
abc
Text (HTML)
arrow_forward
Output
abc
Text (Markdown)
Model Details
Parameters: 1.54B
Input Token Length: 256K
Language Support
🌍 Multilingual support
Related Models
link
reader-lm-0.5b
Tags
reader
language-model
multilingual
document-processing
long-context
text-understanding
content-extraction
cross-lingual
Available via
Commercial LicenseAWS SageMakerMicrosoft AzureHugging Face
Choose models to compare

Overview

Reader LM 1.5B represents a breakthrough in efficient document processing, addressing the critical challenge of converting complex web content into clean, structured formats. This specialized language model tackles a fundamental problem in modern AI pipelines: the need to efficiently process and clean HTML content for downstream tasks without relying on brittle rule-based systems or resource-intensive large language models. What makes this model truly remarkable is its ability to outperform models 50 times its size while maintaining a surprisingly compact 1.54B parameter footprint. Organizations dealing with large-scale web content processing, documentation automation, or content management systems will find this model particularly valuable for its ability to handle extremely long documents while delivering superior accuracy in HTML-to-markdown conversion.

Methods

The model employs an innovative "shallow-but-wide" architecture that challenges traditional scaling approaches in language model design. At its core are 28 transformer layers configured with 12 query heads and 2 key-value heads, creating a unique balance that optimizes for selective-copy operations while maintaining deep semantic understanding. The architecture features a hidden size of 1536 and an intermediate size of 8960, carefully tuned to handle sequences up to 256K tokens. The training process involved two distinct stages: first focusing on short-and-simple HTML with 32K token sequences, then advancing to long-and-hard HTML with 128K tokens, implementing zigzag-ring-attention for efficient processing. This approach, combined with contrastive search and specialized repetition detection mechanisms, enables the model to avoid common issues like degeneration and dull loops that typically plague smaller language models handling complex document processing tasks.

Performance

In comprehensive benchmark evaluations, Reader LM 1.5B demonstrates exceptional capabilities that challenge industry standards. The model achieves a ROUGE-L score of 0.72 and a Token Error Rate of 0.19, significantly outperforming larger models like GPT-4 (0.43 ROUGE-L, 0.50 TER) and Gemini-1.5-Pro (0.42 ROUGE-L, 0.48 TER) in HTML-to-markdown conversion tasks. Its performance particularly shines in qualitative evaluations across four key dimensions: header extraction, main content extraction, rich structure preservation, and markdown syntax usage. The model consistently maintains high accuracy across diverse document types, from news articles and blog posts to landing pages and forum posts, in multiple languages including English, German, Japanese, and Chinese. This performance is achieved while processing documents up to 256K tokens in length, eliminating the need for expensive chunking operations that are typically required with larger models.

Best Practice

To effectively deploy Reader LM 1.5B, organizations should focus on scenarios involving complex HTML document processing where accuracy and efficiency are paramount. The model requires CUDA-capable GPU infrastructure for optimal performance, though its efficient architecture means it can run effectively on more modest hardware compared to larger alternatives. For production deployments, the model is available through both AWS SageMaker and Azure Marketplace, offering flexible integration options. While the model excels at HTML-to-markdown conversion, it's important to note that it's specifically optimized for this task and may not be suitable for general-purpose text generation or other NLP tasks. When processing extremely long documents (approaching 512K tokens), users should be aware that performance might degrade as this exceeds the model's training parameters. For optimal results, implement the provided repetition detection mechanisms and consider using contrastive search during inference to maintain output quality.
Blogs that mention this model
September 11, 2024 • 13 minutes read
Reader-LM: Small Language Models for Cleaning and Converting HTML to Markdown
Reader-LM-0.5B and Reader-LM-1.5B are two novel small language models inspired by Jina Reader, designed to convert raw, noisy HTML from the open web into clean markdown.
Jina AI
Technical screenshot displaying "REAPER-LM-0.5B/1.5B" with HTML source code for Jina's search grounding feature.
January 15, 2025 • 17 minutes read
ReaderLM v2: Frontier Small Language Model for HTML to Markdown and JSON
ReaderLM-v2 is a 1.5B small language model for HTML-to-Markdown conversion and HTML-to-JSON extraction with exceptional quality.
Jina AI
Orange text "ReaderLM-u2" on a vibrant dark red digital screen.
Offices
location_on
Sunnyvale, CA
710 Lakeway Dr, Ste 200, Sunnyvale, CA 94085, USA
location_on
Berlin, Germany (HQ)
Prinzessinnenstraße 19-20, 10969 Berlin, Germany
location_on
Beijing, China
Level 5, Building 6, No.48 Haidian West St. Beijing, China
location_on
Shenzhen, China
402 Floor 4, Fu'an Technology Building, Shenzhen, China
Search Foundation
DeepSearch
Reader
Embeddings
Reranker
Classifier
Segmenter
API Documentation
Get Jina API key
Rate Limit
API Status
Company
About us
Contact sales
Newsroom
Intern program
Join us
open_in_new
Download logo
open_in_new
Terms
Security
Terms & Conditions
Privacy
Manage Cookies
email
Jina AI © 2020-2025.