Expand description
§webshift
Denoised web search library for AI agents. Webshift fetches, cleans, reranks, and budget-caps web content so that LLM pipelines receive high-signal context without flooding their context windows. Every code path enforces hard limits on download size, per-page character count, and total query budget.
§Feature flags
| Feature | Default | Enables |
|---|---|---|
backends | on | All 8 search backends (SearXNG, Brave, Tavily, Exa, SerpAPI, Google, Bing, HTTP) and the query() pipeline |
llm | off | OpenAI-compatible LLM client, query expansion, summarization, and LLM-assisted reranking |
Minimal dependency (cleaner + fetcher only):
webshift = { version = "0.2", default-features = false }§Use cases
§HTML cleaning only (default-features = false)
Synchronous, zero-network, zero-config HTML-to-text conversion:
let result = webshift::clean("<html><body><nav>menu</nav><p>Hello world</p></body></html>", 8000);
assert!(result.text.contains("Hello world"));
assert!(!result.text.contains("menu")); // noise removed§Fetch and clean a single page
let config = webshift::Config::default();
let result = webshift::fetch("https://example.com", &config).await?;
println!("title: {}", result.title);
println!("text: {}...", &result.text[..100]);§Full search pipeline (requires backends)
let config = webshift::Config::load()?;
let result = webshift::query(&["rust async runtime"], &config).await?;
for source in &result.sources {
println!("[{}] {} — {}", source.id, source.title, source.url);
}§Anti-flooding protections
max_download_mb: streaming cap per page (never buffers the full response)max_result_length: hard character cap per cleaned pagemax_query_budget: total character budget across all sourcesmax_total_results: hard cap on results per call- Binary extension filter runs before any network request
- Unicode/BiDi sterilization in the cleaner
Re-exports§
pub use config::Config;
Modules§
- backends
backends - Search backend trait and implementations.
- config
- Configuration system: CLI args > env vars > webshift.toml > defaults.
- llm
llm - LLM integration: OpenAI-compatible client, query expansion, summarization.
- scraper
- HTML fetching and cleaning pipeline.
- utils
- Utility modules: URL handling, reranking.
Structs§
- Clean
Result - Result of cleaning raw HTML into LLM-ready plain text.
- Fetch
Result - Result of fetching and cleaning a single page.
- Query
Result - Result of a full search query pipeline.
- Snippet
Entry - A snippet-only entry from the oversampling reserve pool.
- Source
- A single source in a query result.
- Stats
- Statistics for a query execution.
Enums§
- Webshift
Error - Top-level error type for the webshift library.
Functions§
- clean
- Clean raw HTML into LLM-ready plain text.
- fetch
- Fetch and clean a single web page.
- query
backends - Execute a full search query pipeline.
- query_
with_ options backends - Full query pipeline with optional overrides.