Crate webshift

Expand description

§webshift

Denoised web search library for AI agents. Webshift fetches, cleans, reranks, and budget-caps web content so that LLM pipelines receive high-signal context without flooding their context windows. Every code path enforces hard limits on download size, per-page character count, and total query budget.

§Feature flags

Feature	Default	Enables
`backends`	on	All 8 search backends (SearXNG, Brave, Tavily, Exa, SerpAPI, Google, Bing, HTTP) and the `query()` pipeline
`llm`	off	OpenAI-compatible LLM client, query expansion, summarization, and LLM-assisted reranking

Minimal dependency (cleaner + fetcher only):

webshift = { version = "0.2", default-features = false }

§Use cases

§HTML cleaning only (`default-features = false`)

Synchronous, zero-network, zero-config HTML-to-text conversion:

let result = webshift::clean("<html><body><nav>menu</nav><p>Hello world</p></body></html>", 8000);
assert!(result.text.contains("Hello world"));
assert!(!result.text.contains("menu")); // noise removed

§Fetch and clean a single page

let config = webshift::Config::default();
let result = webshift::fetch("https://example.com", &config).await?;
println!("title: {}", result.title);
println!("text:  {}...", &result.text[..100]);

§Full search pipeline (requires `backends`)

let config = webshift::Config::load()?;
let result = webshift::query(&["rust async runtime"], &config).await?;
for source in &result.sources {
    println!("[{}] {} — {}", source.id, source.title, source.url);
}

§Anti-flooding protections

max_download_mb: streaming cap per page (never buffers the full response)
max_result_length: hard character cap per cleaned page
max_query_budget: total character budget across all sources
max_total_results: hard cap on results per call
Binary extension filter runs before any network request
Unicode/BiDi sterilization in the cleaner

Re-exports§

pub use config::Config;

Modules§

backendsbackends: Search backend trait and implementations.
config: Configuration system: CLI args > env vars > webshift.toml > defaults.
llmllm: LLM integration: OpenAI-compatible client, query expansion, summarization.
scraper: HTML fetching and cleaning pipeline.
utils: Utility modules: URL handling, reranking.

Structs§

CleanResult: Result of cleaning raw HTML into LLM-ready plain text.
FetchResult: Result of fetching and cleaning a single page.
QueryResult: Result of a full search query pipeline.
SnippetEntry: A snippet-only entry from the oversampling reserve pool.
Source: A single source in a query result.
Stats: Statistics for a query execution.
TextMaptext-map: Result of extract_text_nodes: an ordered list of content text nodes with the page title.
TextNodetext-map: A single text node extracted from HTML.
TextReplacementtext-map: A replacement instruction: change the text of a specific TextNode by its id.

Enums§

WebshiftError: Top-level error type for the webshift library.

Functions§

clean: Clean raw HTML into LLM-ready plain text.
extract_text_nodestext-map: Extract text nodes from HTML, skipping noise elements.
fetch: Fetch and clean a single web page.
querybackends: Execute a full search query pipeline.
query_with_optionsbackends: Full query pipeline with optional overrides.
replace_text_nodestext-map: Rebuild HTML with replaced text nodes.

Crate webshift

Crate webshift Copy item path

§webshift

§Feature flags

§Use cases

§HTML cleaning only (default-features = false)

§Fetch and clean a single page

§Full search pipeline (requires backends)

§Anti-flooding protections

Re-exports§

Modules§

Structs§

Enums§

Functions§

Crate webshift

§HTML cleaning only (`default-features = false`)

§Full search pipeline (requires `backends`)