web2llm
Fetch any web page. Get clean Markdown. Ready for LLMs.
web2llm is a high-performance, modular Rust crate that fetches web pages, strips away computational noise (ads, navbars, footers, scripts), and converts the core content into clean Markdown optimized for Large Language Model (LLM) ingestion and Retrieval-Augmented Generation (RAG) pipelines.
Quick Start
Add this to your Cargo.toml:
[]
= "0.3.0"
= { = "1", = ["rt-multi-thread", "macros"] }
Fetch and print Markdown in one call:
use fetch;
async
Features
- Content-aware extraction — isolates the main article body with extreme precision.
- Clean Markdown output — preserves headings, tables, code blocks, and inline links.
- Adaptive fetch — automatic fallback to headless browser for JS-heavy SPAs.
- High Performance — zero-copy traversal and bump-allocation (~3.9ms for Wikipedia).
- Semantic Chunking — divide content into logical, token-budgeted islands for AI apps.
Configuration & Fetch Strategies
You can control how web2llm handles pages via the FetchMode configuration:
FetchMode::Static: Fast, standard HTTP request. No JavaScript execution.FetchMode::Dynamic: Uses a headless browser to render the page. Required for SPAs.FetchMode::Auto: (Default) Smart mode. Tries a fast static fetch first, detects if the page is an SPA shell, and automatically restarts using the browser only if needed.
use ;
let config = Web2llmConfig ;
Lightweight Build (Optional)
web2llm includes Chromium support by default for a "plug-and-play" experience. Power users who only need static scraping can disable defaults to remove the Chromium dependency (~50 sub-dependencies):
[]
= { = "0.3.0", = false }
Performance
web2llm is built for extreme speed and high-throughput ingestion.
Note: Metrics represent pure extraction and processing throughput, excluding network latency.
| Task | Average Time | Throughput |
|---|---|---|
| Simple Page Extraction | < 1.0 ms | ~1,000+ pages/sec |
| Wikipedia (Large) Extraction | ~4.0 ms | ~250 pages/sec |
| Batch Fetch (100x Wikipedia) | ~100 ms | ~1,000 pages/sec |
Speed may vary on different systems
Advanced: Semantic Chunking
For "true AI" applications and RAG pipelines, web2llm can divide documents into logical, structurally-aware chunks that fit your token budget without splitting paragraphs mid-sentence.
let config = Web2llmConfig ;
let client = new.unwrap;
let result = client.fetch.await.unwrap;
// Access granular chunks for precise vector embedding
for chunk in result.chunks
Architecture
The pipeline executes in 5 stages:
URL
│
▼
[1] Pre-flight — URL validation, robots.txt check, rate limiting
│
▼
[2] Fetch — Static fetch (reqwest) or Dynamic fallback (chromiumoxide)
│
▼
[3] Score — Bottom-up recursive scoring builds a "Scored Tree" (Bump-allocated)
│
▼
[4] Chunk & Wash — Top-down "Flatten or Recurse" chunking + Markdown optimization
│
▼
[5] Output — PageResult struct containing Vec<PageChunk>
Roadmap
- Vertical slice — fetch, extract, score, convert to Markdown
- Unified error handling
-
Web2llmConfig— idiomatic initialization - Performance optimizations — bump-allocation and zero-copy traversal
- Batch fetch — parallel fetching across CPU cores
- Adaptive fetch — SPA detection and browser fallback
- Rate limiting — per-host throttling
- Token counting & Semantic chunking
- Recursive spider with concurrent link queue
- MCP server —
web2llm-mcp - CLI —
web2llm-cli