web2llm 0.4.0

Fetch web pages and convert to clean Markdown for LLM pipelines

Coverage
82.69%
43 out of 52 items documented4 out of 19 items with examples
Size
Source code size: 184.22 kB This is the summed size of all the files inside the crates.io package for this release.
Documentation size: 3.08 MB This is the summed size of all files generated by rustdoc for all configured targets
Ø build duration
this release: 6m 25s Average build duration of successful builds.
all releases: 4m 25s Average build duration of successful builds in releases after 2024-10-23.
Links
Homepage
Quippy22/web2llm
0 0 0
crates.io
Dependencies
Versions
Owners

web2llm

Fetch any web page. Get clean Markdown. Ready for LLMs.

`web2llm` is a high-performance, modular Rust crate that fetches web pages, strips away computational noise (ads, navbars, footers, scripts), and converts the core content into clean Markdown optimized for Large Language Model (LLM) ingestion and Retrieval-Augmented Generation (RAG) pipelines.

Quick Start

Add this to your Cargo.toml:

[dependencies]
web2llm = "0.3.1"
tokio = { version = "1", features = ["rt-multi-thread", "macros", "sync", "time"] }

Fetch and print Markdown in one call:

use web2llm::fetch;

#[tokio::main]
async fn main() {
    // 1. Simple fetch (Uses Auto-mode + default settings)
    let result = fetch("https://example.com".to_string()).await.unwrap();
    
    // 2. Print the cleaned Markdown
    println!("{}", result.markdown());
}

Features

Content-aware extraction — isolates the main article body with extreme precision.
Clean Markdown output — preserves headings, tables, code blocks, and inline links.
Adaptive fetch — automatic fallback to headless browser for JS-heavy SPAs.
High Performance — zero-copy traversal and bump-allocation (~3.9ms for Wikipedia).
Semantic Chunking — divide content into logical, token-budgeted islands for AI apps.
Recursive crawling — breadth-first link discovery followed by a final parallel fetch pass.

Configuration & Fetch Strategies

You can control how web2llm handles pages via the FetchMode configuration:

FetchMode::Static: Fast, standard HTTP request. No JavaScript execution.
FetchMode::Dynamic: Uses a headless browser to render the page. Required for SPAs.
FetchMode::Auto: (Default) Smart mode. Tries a fast static fetch first, detects if the page is an SPA shell, and automatically restarts using the browser only if needed.

use web2llm::{Web2llm, Web2llmConfig, FetchMode};

let config = Web2llmConfig {
    fetch_mode: FetchMode::Auto,
    ..Default::default()
};

Lightweight Build (Optional)

web2llm includes Chromium support by default for a "plug-and-play" experience. Power users who only need static scraping can disable defaults to remove the Chromium dependency (~50 sub-dependencies):

[dependencies]
web2llm = { version = "0.3.1", default-features = false }

Performance

web2llm is built for extreme speed and high-throughput ingestion.

Note: Metrics represent pure extraction and processing throughput, excluding network latency.

Task	Average Time	Throughput
Simple Page Extraction	~0.07 ms	~14,000+ pages/sec
Wikipedia (Large) Extraction	~3.1 ms	~320 pages/sec
Batch Fetch (100x Wikipedia)	~100 ms	~1,000 pages/sec

Speed may vary on different systems

Advanced: Semantic Chunking

For "true AI" applications and RAG pipelines, web2llm can divide documents into logical, structurally-aware chunks that fit your token budget without splitting paragraphs mid-sentence.

let config = Web2llmConfig {
    max_tokens: 500, // Target 500 tokens per chunk
    ..Default::default()
};

let client = Web2llm::new(config).unwrap();
let result = client.fetch(url).await.unwrap();

// Access granular chunks for precise vector embedding
for chunk in result.chunks {
    println!("Chunk #{} ({} tokens): {:.2} quality score", chunk.index, chunk.tokens, chunk.score);
}

Advanced: Crawling

For multi-page ingestion, web2llm can crawl outward from a seed URL using a simple two-stage model:

Discover links breadth-first with get_urls
Run one final batch_fetch over the full deduplicated URL set

By default, crawling is conservative and stays pinned to the same origin as the seed URL.

use web2llm::{CrawlConfig, Web2llm, Web2llmConfig};

#[tokio::main]
async fn main() {
    let client = Web2llm::new(Web2llmConfig::default()).unwrap();

    let results = client
        .crawl(
            "https://example.com",
            CrawlConfig {
                max_depth: 1,
                preserve_domain: true,
            },
        )
        .await;

    for (url, result) in results {
        match result {
            Ok(page) => println!("{} -> {} chunks", url, page.chunks.len()),
            Err(error) => eprintln!("{} -> {}", url, error),
        }
    }
}

Crawl Configuration

max_depth: maximum number of discovery expansions from the seed URL. 0 means only the seed page is fetched.
preserve_domain: if true (default), only links on the same origin as the seed URL are expanded.

Architecture

The pipeline executes in 5 stages:

URL
 │
 ▼
[1] Pre-flight       — URL validation, robots.txt check, rate limiting
 │
 ▼
[2] Fetch            — Static fetch (reqwest) or Dynamic fallback (chromiumoxide)
 │
 ▼
[3] Score            — Bottom-up recursive scoring builds a "Scored Tree" (Bump-allocated)
 │
 ▼
[4] Chunk & Wash     — Top-down "Flatten or Recurse" chunking + Markdown optimization
 │
 ▼
[5] Output           — PageResult struct containing Vec<PageChunk>

Roadmap

Vertical slice — fetch, extract, score, convert to Markdown
Unified error handling
Web2llmConfig — idiomatic initialization
Performance optimizations — bump-allocation and zero-copy traversal
Batch fetch — parallel fetching across CPU cores
Adaptive fetch — SPA detection and browser fallback
Rate limiting — per-host throttling
Token counting & Semantic chunking
Recursive spider with concurrent link queue
MCP server — web2llm-mcp
CLI — web2llm-cli