Skip to main content

Crate webshift

Crate webshift 

Source
Expand description

§webshift

Denoised web search library for AI agents. Webshift fetches, cleans, reranks, and budget-caps web content so that LLM pipelines receive high-signal context without flooding their context windows. Every code path enforces hard limits on download size, per-page character count, and total query budget.

§Feature flags

FeatureDefaultEnables
backendsonAll 8 search backends (SearXNG, Brave, Tavily, Exa, SerpAPI, Google, Bing, HTTP) and the query() pipeline
llmoffOpenAI-compatible LLM client, query expansion, summarization, and LLM-assisted reranking

Minimal dependency (cleaner + fetcher only):

webshift = { version = "0.2", default-features = false }

§Use cases

§HTML cleaning only (default-features = false)

Synchronous, zero-network, zero-config HTML-to-text conversion:

let result = webshift::clean("<html><body><nav>menu</nav><p>Hello world</p></body></html>", 8000);
assert!(result.text.contains("Hello world"));
assert!(!result.text.contains("menu")); // noise removed

§Fetch and clean a single page

let config = webshift::Config::default();
let result = webshift::fetch("https://example.com", &config).await?;
println!("title: {}", result.title);
println!("text:  {}...", &result.text[..100]);

§Full search pipeline (requires backends)

let config = webshift::Config::load()?;
let result = webshift::query(&["rust async runtime"], &config).await?;
for source in &result.sources {
    println!("[{}] {} — {}", source.id, source.title, source.url);
}

§Anti-flooding protections

  • max_download_mb: streaming cap per page (never buffers the full response)
  • max_result_length: hard character cap per cleaned page
  • max_query_budget: total character budget across all sources
  • max_total_results: hard cap on results per call
  • Binary extension filter runs before any network request
  • Unicode/BiDi sterilization in the cleaner

Re-exports§

pub use config::Config;

Modules§

backendsbackends
Search backend trait and implementations.
config
Configuration system: CLI args > env vars > webshift.toml > defaults.
llmllm
LLM integration: OpenAI-compatible client, query expansion, summarization.
scraper
HTML fetching and cleaning pipeline.
utils
Utility modules: URL handling, reranking.

Structs§

CleanResult
Result of cleaning raw HTML into LLM-ready plain text.
FetchResult
Result of fetching and cleaning a single page.
QueryResult
Result of a full search query pipeline.
SnippetEntry
A snippet-only entry from the oversampling reserve pool.
Source
A single source in a query result.
Stats
Statistics for a query execution.

Enums§

WebshiftError
Top-level error type for the webshift library.

Functions§

clean
Clean raw HTML into LLM-ready plain text.
fetch
Fetch and clean a single web page.
querybackends
Execute a full search query pipeline.
query_with_optionsbackends
Full query pipeline with optional overrides.