web2llm
Fetch any web page. Get clean, token-efficient Markdown. Ready for LLMs.
web2llm is a high-performance, modular Rust crate that fetches web pages, strips away computational noise (ads, navbars, footers, scripts), and converts the core content into clean Markdown optimized for Large Language Model (LLM) ingestion and Retrieval-Augmented Generation (RAG) pipelines.
Why web2llm?
Feeding raw HTML to an LLM is wasteful and noisy. A typical web page is 80% structural boilerplate — navigation, cookie banners, footers, tracking scripts — and only 20% actual content. web2llm inverts that ratio, giving your LLM only what matters.
Features
- Content-aware extraction — scores every element by text density, tag semantics, and link ratio to isolate the main article body
- Clean Markdown output — preserves headings, tables, code blocks, and inline links while discarding layout noise
- Token-efficient — output is designed to minimize token cost in downstream LLM calls
- Shared Headless Browser — single persistent Chromium instance for dynamic pages (requires
renderedfeature) - Adaptive fetch — automatic fallback to headless browser for JS-heavy SPAs
- Robots.txt compliance — respects crawl rules out of the box
- Performance optimized — zero-copy tree traversal, LTO, and minimal allocations
Performance
web2llm is built for extreme speed and high-throughput RAG pipelines.
| Task | Average Time | Throughput |
|---|---|---|
| Simple Page Extraction | < 1.0 ms | ~1,000+ pages/sec |
| Wikipedia (Large) Extraction | ~4.3 ms | ~230 pages/sec |
| Batch Fetch (100x Wikipedia) | ~103.7 ms | ~960+ pages/sec |
Benchmarks performed on an AMD Ryzen 7 5800X. Real-world performance may vary based on network latency.
Note: Batch fetch utilizes true parallelism via tokio::spawn, saturating CPU cores for parsing and scoring while managing I/O efficiently.
Configuration & Features
rendered Feature Flag (Headless Browser)
By default, web2llm is lightweight and only performs static HTTP fetches. To support Single Page Applications (SPAs) or sites that require JavaScript rendering, enable the rendered feature:
[]
= { = "0.2.1", = ["rendered"] }
FetchMode Strategies
You can control how web2llm handles pages via the fetch_mode configuration:
FetchMode::Static: (Default) Fast, standard HTTP request. No JavaScript execution.FetchMode::Dynamic: Uses a headless browser to render the page. Required for SPAs.FetchMode::Auto: Smart mode. Tries a fast static fetch first, detects if the page is an SPA shell, and automatically restarts using the headless browser only if needed.
let config = Web2llmConfig ;
Architecture
The pipeline executes in 5 stages:
URL
│
▼
[1] Pre-flight — URL validation, robots.txt check, rate limiting
│
▼
[2] Fetch — Static fetch (reqwest) or Dynamic fallback (chromiumoxide)
│
▼
[3] Extract — Content scoring isolates main body, link discovery
│
▼
[4] Transform — HTML → clean Markdown
│
▼
[5] Output — PageResult struct, optional disk persistence
Quick Start
[]
= "0.2.1"
= { = "1", = ["rt-multi-thread", "macros"] }
Simple Fetch (Static)
use fetch;
async
Dynamic Fetch (SPA Support)
Enable the rendered feature to support JavaScript-heavy sites:
[]
= { = "0.2.1", = ["rendered"] }
use ;
async
Link Extraction
web2llm provides two ways to extract URLs from a page:
Web2llm::get_urls(url): (Raw) Fetches the page and returns every single absolute link found in the original HTML document (includes nav, footers, etc.).PageResult::get_urls(): (Scored) Returns only the links found within the high-quality content blocks that survived the scoring process.
Roadmap
- Vertical slice — fetch, extract, score, convert to Markdown
- Unified error handling
-
PageResultoutput struct with url, title, markdown, and timestamp -
Web2llmConfig— user-facing configuration struct (idiomatic initialization) - Pre-flight — URL validation and
robots.txtcompliance - Performance optimizations — zero-copy traversal and shared browser
- Batch fetch — fetch multiple URLs concurrently
- Adaptive fetch — SPA detection and headless browser fallback
- Rate limiting — per-host request throttling
- Token counting
- Semantic chunking
- Recursive spider with concurrent link queue
- MCP server —
web2llm-mcp - CLI —
web2llm-cli