WebShift
What is WebShift
WebShift is a Rust library and MCP server that shifts noisy web pages into clean, right-sized text for LLM consumption.
Raw HTML is mostly junk: scripts, ads, navigation menus, cookie banners, tracking pixels. Feeding it directly to an LLM floods the context window with tens of thousands of useless tokens and leaves no room for reasoning. WebShift strips all that noise, sterilizes the text, and enforces strict size budgets so the model receives only the content that matters.
What you get
Depending on the features you enable, WebShift can be three things:
| Use case | Crate | Feature flags | What it does |
|---|---|---|---|
| HTML denoiser | webshift |
default-features = false |
clean() — pure Rust HTML-to-text pipeline. Strips noise elements, sterilizes Unicode/BiDi, collapses whitespace. Zero network, zero config. Drop into any Rust project that processes web content for LLMs. |
| Web content client | webshift |
default or features = ["llm"] |
fetch() + query() — streaming HTTP fetcher with size caps, 8 search backends, BM25 reranking, optional LLM query expansion and summarization. Full pipeline from search query to structured results. |
| MCP server | webshift-mcp |
all features | Native binary (mcp-webshift) that exposes webshift_query, webshift_fetch, and webshift_onboarding over MCP stdio. Single static binary, zero runtime dependencies. |
When to use WebShift
- You're building an AI agent that needs web search and you want clean, budget-controlled text — not raw HTML.
- You're processing web pages in a Rust pipeline and need a reliable HTML-to-text cleaner that strips noise without losing real content.
- You want an MCP web search server that works as a single binary — no Python, no pip, no venv, no Docker (unless you want it).
- You need hard guarantees on output size: per-page caps, total budget caps, streaming download limits.
When NOT to use WebShift
- You need a headless browser that renders JavaScript-heavy SPAs. WebShift parses static HTML — it doesn't execute JS.
- You need to preserve the visual layout or formatting of a page (tables, CSS grids, positioning). WebShift extracts text, not structure.
- You're building a web scraper that needs to navigate across pages, fill forms, or handle authentication flows.
How it works
Question
|
+- (optional) LLM query expansion -> multiple search variants
|
+- Search via backend (SearXNG, Brave, Tavily, Exa, SerpAPI, Google, Bing, HTTP)
|
+- Deduplicate + filter binary URLs
|
+- Streaming fetch with per-page size cap
|
+- HTML cleaning -> plain text (noise elements, scripts, nav removed)
|
+- Unicode/BiDi sterilization
|
+- BM25 deterministic reranking
| +- (optional) LLM-assisted tier-2 reranking
|
+- Budget-aware truncation across all sources
|
+- (optional) LLM Markdown summary with inline citations
|
+- Structured JSON output
For a detailed explanation of each pipeline stage, BM25 parameters, adaptive budget allocation, and real compression metrics see Under the Hood. For the full configuration reference (TOML, env vars, CLI args) see Configuration. For ready-to-use examples see Use Cases.
Installation
Binary (MCP server)
The binary is called mcp-webshift.
From source
As a library
# Full pipeline (search + fetch + clean + rerank)
= "0.2"
# Cleaner + fetcher only (no search backends)
= { = "0.2", = false }
# Everything including LLM features
= { = "0.2", = ["llm"] }
Quick start
1. Set up a search backend
The easiest option is SearXNG — free, self-hosted, no API key:
No Docker? Use a cloud backend — see Search backends.
2. Configure your MCP client
That's it. The agent now has webshift_query, webshift_fetch, and webshift_onboarding.
For client-specific setup see docs/integrations/.
MCP tools
| Tool | Description |
|---|---|
webshift_query |
Full search pipeline: search + fetch + clean + rerank + (optional) summarize |
webshift_fetch |
Single page fetch and clean |
webshift_onboarding |
Returns a JSON guide for the agent (budgets, backends, tips) |
webshift_query parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
queries |
string or list | required | Search query or list of queries |
num_results |
integer | 5 | Results per query |
lang |
string | none | Language filter (e.g. "en") |
backend |
string | config default | Override search backend |
Configuration
Resolution order (highest priority first):
- CLI args —
--default-backend,--brave-api-key, etc. - Environment variables —
WEBSHIFT_*prefix - Config file —
webshift.toml(current dir, then~/webshift.toml) - Built-in defaults
Config file
[]
= 32000 # total char budget across all sources
= 8000 # per-page char cap
= 20 # hard cap on results per call
= 1 # streaming cap per page (MB)
= 8 # seconds
= 5
= 2
= "auto" # "auto" | "on" | "off" — budget allocation mode
[]
= "searxng"
[]
= "http://localhost:4000"
[]
= "BSA-..."
[]
= "tvly-..."
[]
= "..."
[]
= "..."
= "google" # google | bing | duckduckgo | yandex
[]
= "..."
= "..." # Custom Search Engine ID
[]
= "..."
= "en-US"
[]
= "https://my-search.example.com/api/search"
= "q"
= "limit"
= "data.items" # dot-path to results array in JSON response
= "title"
= "link"
= "description"
[]
= "Bearer my-token"
[]
= false
= "http://localhost:11434/v1" # OpenAI-compatible
= ""
= "gemma3:27b"
= 60
= true
= true
= false
For every setting with all three config methods (TOML, env vars, CLI args)
and plain-language descriptions, see the full Configuration Reference.
Ready-to-use config examples are in Use Cases and examples/.
Key environment variables
WEBSHIFT_DEFAULT_BACKEND=searxng
WEBSHIFT_SEARXNG_URL=http://localhost:4000
WEBSHIFT_BRAVE_API_KEY=BSA-xxx
WEBSHIFT_GOOGLE_API_KEY=xxx
WEBSHIFT_GOOGLE_CX=xxx
WEBSHIFT_BING_API_KEY=xxx
WEBSHIFT_LLM_ENABLED=true
WEBSHIFT_LLM_BASE_URL=http://localhost:11434/v1
WEBSHIFT_LLM_MODEL=gemma3:27b
Search backends
| Backend | Auth | Notes |
|---|---|---|
| SearXNG | none | Self-hosted, free. Default: http://localhost:4000 |
| Brave | API key | Free tier. brave.com/search/api |
| Tavily | API key | AI-oriented. tavily.com |
| Exa | API key | Neural search. exa.ai |
| SerpAPI | API key | Multi-engine proxy (Google, Bing, DDG...). serpapi.com |
| API key + CX | Custom Search. Free: 100 req/day. programmablesearchengine.google.com | |
| Bing | API key | Web Search API. Free: 1,000 req/month. Microsoft Azure |
| HTTP | configurable | Generic REST backend — no code required, TOML-only config |
LLM features (optional)
All opt-in — disabled by default, no data leaves your machine unless enabled.
| Feature | What it does |
|---|---|
| Query expansion | Single query -> N complementary search variants |
| Summarization | Markdown report with inline [1] [2] citations |
| LLM reranking | Tier-2 reranking on top of deterministic BM25 |
Cross-language normalization (bonus): when BM25 reranking surfaces pages in foreign languages (e.g. Chinese, Japanese, Arabic), the LLM summarizer still produces the final report in the prompt language. The agent receives clean, readable output regardless of the language mix in the source pages.
Works with any OpenAI-compatible API (OpenAI, Ollama, vLLM, LM Studio, etc.):
[]
= true
= "http://localhost:11434/v1"
= "gemma3:27b"
Anti-flooding protections
Always active — the core value proposition:
| Protection | Description |
|---|---|
max_download_mb |
Streaming cap — never buffers full response |
max_result_length |
Hard cap on characters per cleaned page |
max_query_budget |
Total character budget across all sources |
max_total_results |
Hard cap on results per call |
| Binary filter | .pdf, .zip, .exe, etc. filtered before any network request |
| Unicode sterilization | BiDi control chars, zero-width chars removed |
Library usage
use ;
// Clean raw HTML
let result = clean;
// Fetch and clean a single page
let config = default;
let page = fetch.await?;
// Full search pipeline
let results = query.await?;
for source in &results.sources
Feature flags
| Feature | Default | Enables |
|---|---|---|
backends |
on | All search backends + query pipeline |
llm |
off | LLM client, expander, summarizer, LLM reranking |
Integrations
| Platform | Guide |
|---|---|
| Claude Desktop, Claude Code, Zed, Cursor, Windsurf, VS Code | IDE Integration |
| Gemini CLI, Claude CLI, custom agents | Agent Integration |
Alpha Status
WebShift is in alpha. Core functionality is stable and the server is used daily, but the API surface may still change before 1.0.
Feedback is very welcome. If something doesn't work as expected, behaves oddly, or you have a use case that isn't covered:
Bug reports, configuration questions, and feature requests all help shape the roadmap.
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for detailed guidelines on:
- Development setup and workflow
- Code style and conventions
- Testing requirements
- Documentation standards
- Pull request process
License
MIT License — see LICENSE for details.
Links
- GitHub Repository — Source code and issues
- MCP Protocol — Model Context Protocol specification
Need help? Check the documentation or open an issue on GitHub.