web2llm 0.2.1

Fetch web pages and convert to clean Markdown for LLM pipelines

Coverage
86.36%
38 out of 44 items documented4 out of 15 items with examples
Size
Source code size: 325.44 kB This is the summed size of all the files inside the crates.io package for this release.
Documentation size: 2.97 MB This is the summed size of all files generated by rustdoc for all configured targets
Ø build duration
this release: 5m 24s Average build duration of successful builds.
all releases: 4m Average build duration of successful builds in releases after 2024-10-23.
Links
Homepage
Quippy22/web2llm
0 0 0
crates.io
Dependencies
Versions
Owners

web2llm

Fetch any web page. Get clean, token-efficient Markdown. Ready for LLMs.

web2llm is a high-performance, modular Rust crate that fetches web pages, strips away computational noise (ads, navbars, footers, scripts), and converts the core content into clean Markdown optimized for Large Language Model (LLM) ingestion and Retrieval-Augmented Generation (RAG) pipelines.

Why web2llm?

Feeding raw HTML to an LLM is wasteful and noisy. A typical web page is 80% structural boilerplate — navigation, cookie banners, footers, tracking scripts — and only 20% actual content. web2llm inverts that ratio, giving your LLM only what matters.

Features

Content-aware extraction — scores every element by text density, tag semantics, and link ratio to isolate the main article body
Clean Markdown output — preserves headings, tables, code blocks, and inline links while discarding layout noise
Token-efficient — output is designed to minimize token cost in downstream LLM calls
Shared Headless Browser — single persistent Chromium instance for dynamic pages (requires rendered feature)
Adaptive fetch — automatic fallback to headless browser for JS-heavy SPAs
Robots.txt compliance — respects crawl rules out of the box
Performance optimized — zero-copy tree traversal, LTO, and minimal allocations

Performance

web2llm is built for extreme speed and high-throughput RAG pipelines.

Task	Average Time	Throughput
Simple Page Extraction	< 1.0 ms	~1,000+ pages/sec
Wikipedia (Large) Extraction	~4.3 ms	~230 pages/sec
Batch Fetch (100x Wikipedia)	~103.7 ms	~960+ pages/sec

Benchmarks performed on an AMD Ryzen 7 5800X. Real-world performance may vary based on network latency.

Note: Batch fetch utilizes true parallelism via tokio::spawn, saturating CPU cores for parsing and scoring while managing I/O efficiently.

Configuration & Features

`rendered` Feature Flag (Headless Browser)

By default, web2llm is lightweight and only performs static HTTP fetches. To support Single Page Applications (SPAs) or sites that require JavaScript rendering, enable the rendered feature:

[dependencies]
web2llm = { version = "0.2.1", features = ["rendered"] }

`FetchMode` Strategies

You can control how web2llm handles pages via the fetch_mode configuration:

FetchMode::Static: (Default) Fast, standard HTTP request. No JavaScript execution.
FetchMode::Dynamic: Uses a headless browser to render the page. Required for SPAs.
FetchMode::Auto: Smart mode. Tries a fast static fetch first, detects if the page is an SPA shell, and automatically restarts using the headless browser only if needed.

let config = Web2llmConfig {
    fetch_mode: FetchMode::Auto,
    ..Default::default()
};

Architecture

The pipeline executes in 5 stages:

URL
 │
 ▼
[1] Pre-flight       — URL validation, robots.txt check, rate limiting
 │
 ▼
[2] Fetch            — Static fetch (reqwest) or Dynamic fallback (chromiumoxide)
 │
 ▼
[3] Extract          — Content scoring isolates main body, link discovery
 │
 ▼
[4] Transform        — HTML → clean Markdown
 │
 ▼
[5] Output           — PageResult struct, optional disk persistence

Quick Start

[dependencies]
web2llm = "0.2.1"
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }

Simple Fetch (Static)

use web2llm::fetch;

#[tokio::main]
async fn main() {
    let result = fetch("https://example.com".to_string()).await.unwrap();
    println!("{}", result.markdown);
}

Dynamic Fetch (SPA Support)

Enable the rendered feature to support JavaScript-heavy sites:

[dependencies]
web2llm = { version = "0.2.1", features = ["rendered"] }

use web2llm::{Web2llm, Web2llmConfig, FetchMode};

#[tokio::main]
async fn main() {
    let config = Web2llmConfig {
        fetch_mode: FetchMode::Auto, // Automatically use browser if SPA is detected
        ..Default::default()
    };
    
    let client = Web2llm::new(config).unwrap();
    let result = client.fetch("https://reddit.com").await.unwrap();
    println!("{}", result.markdown);

    // Extract links found in the scored content
    let links = result.get_urls();
}

Link Extraction

web2llm provides two ways to extract URLs from a page:

Web2llm::get_urls(url): (Raw) Fetches the page and returns every single absolute link found in the original HTML document (includes nav, footers, etc.).
PageResult::get_urls(): (Scored) Returns only the links found within the high-quality content blocks that survived the scoring process.

Roadmap

Vertical slice — fetch, extract, score, convert to Markdown
Unified error handling
PageResult output struct with url, title, markdown, and timestamp
Web2llmConfig — user-facing configuration struct (idiomatic initialization)
Pre-flight — URL validation and robots.txt compliance
Performance optimizations — zero-copy traversal and shared browser
Batch fetch — fetch multiple URLs concurrently
Adaptive fetch — SPA detection and headless browser fallback
Rate limiting — per-host request throttling
Token counting
Semantic chunking
Recursive spider with concurrent link queue
MCP server — web2llm-mcp
CLI — web2llm-cli