web2llm
Fetch any web page. Get clean, token-efficient Markdown. Ready for LLMs.
web2llm is a high-performance, modular Rust crate that fetches web pages, strips away computational noise (ads, navbars, footers, scripts), and converts the core content into clean Markdown optimized for Large Language Model (LLM) ingestion and Retrieval-Augmented Generation (RAG) pipelines.
Why web2llm?
Feeding raw HTML to an LLM is wasteful and noisy. A typical web page is 80% structural boilerplate — navigation, cookie banners, footers, tracking scripts — and only 20% actual content. web2llm inverts that ratio, giving your LLM only what matters.
Features
- Content-aware extraction — scores every element by text density, tag semantics, and link ratio to isolate the main article body
- Clean Markdown output — preserves headings, tables, code blocks, and inline links while discarding layout noise
- Token-efficient — output is designed to minimize token cost in downstream LLM calls
- Shared Headless Browser — single persistent Chromium instance for dynamic pages (requires
renderedfeature) - Adaptive fetch — automatic fallback to headless browser for JS-heavy SPAs
- Robots.txt compliance — respects crawl rules out of the box
- Performance optimized — zero-copy tree traversal, LTO, and minimal allocations
Configuration & Features
rendered Feature Flag (Headless Browser)
By default, web2llm is lightweight and only performs static HTTP fetches. To support Single Page Applications (SPAs) or sites that require JavaScript rendering, enable the rendered feature:
[]
= { = "0.2.0", = ["rendered"] }
FetchMode Strategies
You can control how web2llm handles pages via the fetch_mode configuration:
FetchMode::Static: (Default) Fast, standard HTTP request. No JavaScript execution.FetchMode::Dynamic: Uses a headless browser to render the page. Required for SPAs.FetchMode::Auto: Smart mode. Tries a fast static fetch first, detects if the page is an SPA shell, and automatically restarts using the headless browser only if needed.
let config = Web2llmConfig ;
Architecture
The pipeline executes in 5 stages:
URL
│
▼
[1] Pre-flight — URL validation, robots.txt check, rate limiting
│
▼
[2] Fetch — Static fetch (reqwest) or Dynamic fallback (chromiumoxide)
│
▼
[3] Extract — Content scoring isolates main body, link discovery
│
▼
[4] Transform — HTML → clean Markdown
│
▼
[5] Output — PageResult struct, optional disk persistence
Quick Start
[]
= "0.2.0"
= { = "1", = ["rt-multi-thread", "macros"] }
Simple Fetch (Static)
use fetch;
async
Dynamic Fetch (SPA Support)
Enable the rendered feature to support JavaScript-heavy sites:
[]
= { = "0.2.0", = ["rendered"] }
use ;
async
Roadmap
- Vertical slice — fetch, extract, score, convert to Markdown
- Unified error handling
-
PageResultoutput struct with url, title, markdown, and timestamp -
Web2llmConfig— user-facing configuration struct (idiomatic initialization) - Pre-flight — URL validation and
robots.txtcompliance - Performance optimizations — zero-copy traversal and shared browser
- Batch fetch — fetch multiple URLs concurrently
- Adaptive fetch — SPA detection and headless browser fallback
- Rate limiting — per-host request throttling
- Token counting
- Semantic chunking
- Recursive spider with concurrent link queue
- MCP server —
web2llm-mcp - CLI —
web2llm-cli