Skip to main content

Crate web2llm

Crate web2llm 

Source
Expand description

§web2llm

web2llm is a high-performance Rust crate designed to fetch web pages and convert their core content into clean, token-efficient Markdown. It’s optimized for feeding data into Large Language Models (LLMs) and RAG pipelines.

§Key Features

  • High Performance: Zero-copy tree traversal, LTO, and efficient scoring.
  • Clean Output: Strips navigation, headers, footers, and non-essential attributes.
  • Shared Browser: Single persistent headless Chromium instance for dynamic pages (requires rendered feature).
  • Adaptive Fetch: Automatically detects SPAs and uses a browser fallback for full rendering.
  • SSRF Protection: Validates URLs and blocks private host access by default.
  • Robots.txt Compliance: Optionally respects robots.txt rules.
  • Rate Limiting: Built-in support for throttling and concurrency control.

§Quick Start

The easiest way to get started is using the convenience fetch function:

use web2llm::fetch;

#[tokio::main]
async fn main() {
    // Fetch a page with default configuration
    match fetch("https://example.com".to_string()).await {
        Ok(result) => {
            println!("Title: {}", result.title);
            println!("Markdown content:\n{}", result.markdown());
        }
        Err(e) => eprintln!("Error: {}", e),
    }
}

For more control, use the Web2llm struct with a custom Web2llmConfig.

Re-exports§

pub use config::Web2llmConfig;
pub use error::Web2llmError;
pub use output::PageResult;

Modules§

config
error
output

Structs§

Web2llm
The main entry point for the web2llm pipeline.

Enums§

FetchMode
Defines the strategy used to fetch a page.

Functions§

batch_fetch
Convenience function — fetches multiple urls using Web2llmConfig::default.
fetch
Convenience function — fetches url using Web2llmConfig::default.