Essence
Website | Docs | crates.io | GitHub
A fast, open-source web retrieval engine built in Rust. Fetches pages via lightweight HTTP with automatic fallback to headless Chromium for JavaScript-heavy sites. Returns clean, LLM-ready Markdown.
91% quality win rate against Firecrawl across 35 real-world URLs, judged by LLM evaluation. 1.5x faster on average.
Install
Why Essence?
- Two-tier rendering -- lightweight HTTP fetch for most pages, automatic Chromium fallback for SPAs and JS-heavy sites
- LLM-optimized output -- clean Markdown with noise removal, heading hierarchy, code preservation, and structured metadata
- 4 REST endpoints + MCP server -- scrape, crawl, map, search -- usable by humans and AI agents alike
- Self-hosted & open-source -- no API keys, no rate limits, no vendor lock-in
- Respectful crawling -- robots.txt compliance, per-domain rate limits, configurable politeness
- Fast -- median response under 1 second for HTTP-rendered pages
Quick Start
Prerequisites
# macOS
# Ubuntu/Debian
# Or skip Chromium -- the HTTP engine handles most pages without a browser
Build and Run
# Server starts at http://localhost:8080
Docker
# Service available at http://localhost:8080
Try It
# Scrape a page to Markdown
|
# Discover all URLs on a site
|
# Crawl a site (follow links, respect robots.txt)
|
# Search the web
|
API Reference
All endpoints accept JSON POST requests at /api/v1/.
POST /api/v1/scrape
Single-page capture with automatic engine selection.
Request:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | URL to scrape |
formats |
string[] | ["markdown"] |
Output formats: "markdown", "html", "links", "metadata" |
engine |
string | "auto" |
Rendering engine: "auto", "http", or "browser" |
timeout |
int | 30000 |
Timeout in milliseconds |
headers |
object | none | Custom HTTP headers |
onlyMainContent |
bool | true |
Extract only the main content area |
Response:
POST /api/v1/crawl
Multi-page traversal with dedup, robots.txt compliance, and rate limiting.
Request:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Starting URL |
maxDepth |
int | 2 |
Maximum link depth to follow |
limit |
int | 100 |
Maximum pages to crawl |
includePaths |
string[] | none | Glob patterns to include (e.g. ["/blog/*"]) |
excludePaths |
string[] | none | Glob patterns to exclude (e.g. ["/admin/*"]) |
allowBackwardLinks |
bool | false |
Follow links up the URL hierarchy |
allowExternalLinks |
bool | false |
Follow links to other domains |
detectPagination |
bool | true |
Automatically follow paginated content |
Response:
POST /api/v1/map
URL discovery via sitemaps and in-page link extraction.
Request:
| Field | Type | Default | Description |
|---|---|---|---|
url |
string | required | Site URL to discover links from |
includeSubdomains |
bool | true |
Include URLs from subdomains |
limit |
int | 5000 |
Maximum URLs to return |
search |
string | none | Filter URLs by search query |
ignoreSitemap |
bool | false |
Skip sitemap.xml discovery |
Response:
POST /api/v1/search
Web search via DuckDuckGo with optional scraping of result pages.
Request:
| Field | Type | Default | Description |
|---|---|---|---|
query |
string | required | Search query |
limit |
int | 10 |
Number of search results |
scrapeResults |
bool | false |
Scrape full content of each result |
Response:
POST /api/v1/crawl/stream
Streaming variant of the crawl endpoint. Returns results via Server-Sent Events (SSE) as pages are crawled, rather than waiting for the full crawl to complete. Accepts the same parameters as /api/v1/crawl.
GET /health
Health check endpoint. Returns 200 OK when the server is running. Used by Docker's HEALTHCHECK.
MCP Server (AI Agent Integration)
Essence includes a built-in Model Context Protocol server, allowing AI agents (Claude, Cursor, Windsurf, etc.) to use scrape, crawl, map, and search as tools directly.
The MCP endpoint is available at http://localhost:8080/mcp using the Streamable HTTP transport. Start the Essence server first, then connect your MCP client.
Available MCP Tools
| Tool | Description |
|---|---|
scrape |
Fetch a single page and return Markdown content. Params: url (required), formats, engine, timeout_ms |
crawl |
Crawl a website up to a given depth and page limit. Params: url (required), max_depth, limit, include_paths, exclude_paths |
map |
Discover URLs on a site via sitemaps and link extraction. Params: url (required), search, limit, include_subdomains |
search |
Search the web and optionally scrape results. Params: query (required), limit, scrape_results |
Note: The MCP server uses
timeout_mswhile the REST API usestimeout. Both accept milliseconds.
Setup with Claude Code
Add to your project's .mcp.json (or global ~/.claude/.mcp.json):
Then use Essence tools directly in Claude Code conversations -- Claude will call scrape, crawl, map, and search as needed.
Setup with Claude Desktop
Add to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json on macOS):
Setup with Cursor / Windsurf / Other MCP Clients
Point your MCP client at http://localhost:8080/mcp using the HTTP transport. No API key required.
Architecture
Two-Tier Rendering
Request → HTTP Engine (fast, lightweight)
↓ content density too low?
Browser Engine (Chromium CDP)
↓
Markdown Output (clean, LLM-ready)
- HTTP Engine (default) -- lightweight fetch via reqwest. Used when the HTML contains sufficient content density.
- Browser Engine (fallback) -- full Chromium automation via CDP. Activated for SPAs, anti-bot pages, and JavaScript-rendered content.
Auto-detection triggers browser fallback based on:
- Content density analysis (too little text in HTML)
- Hydration markers (
__NEXT_DATA__,__NUXT__, etc.) - Meta-refresh redirects
- Anti-fetch response headers
Module Structure
backend/src/
api/ # Endpoint handlers (scrape, crawl, map, search)
engines/ # HTTP and Browser rendering engines
format/ # Markdown and metadata output formatters
crawler/ # URL frontier, robots.txt, rate limiting, sitemaps, pagination
search/ # DuckDuckGo HTML search integration
cache/ # In-memory caching (moka)
rate_limit/ # Per-domain rate limiting (governor)
utils/ # DNS cache, URL normalization, robots.txt parsing
mcp.rs # MCP server (Model Context Protocol)
types.rs # Request/response types
error.rs # Error types
config.rs # Configuration
main.rs # Server startup with graceful shutdown
Configuration
Copy backend/.env.example to backend/.env and customize:
| Variable | Default | Description |
|---|---|---|
PORT |
8080 |
Server port |
RUST_LOG |
essence=info |
Log level (debug, info, warn, error) |
BROWSER_HEADLESS |
true |
Run Chromium in headless mode |
BROWSER_POOL_SIZE |
5 |
Number of browser instances in pool |
BROWSER_TIMEOUT_MS |
30000 |
Browser page timeout |
ENGINE_WATERFALL_ENABLED |
true |
Enable HTTP-to-browser fallback |
ENGINE_WATERFALL_DELAY_MS |
5000 |
Delay before browser fallback |
CRAWL_RATE_LIMIT_PER_SEC |
2 |
Crawl rate limit per domain |
MAX_CONCURRENT_REQUESTS |
10 |
Max concurrent crawl requests |
MAX_PARALLEL_SCRAPES |
5 |
Parallel scrapes for search results |
MAX_REQUEST_SIZE_MB |
1 |
Maximum request body size |
CRAWL_MAX_DURATION_SEC |
300 |
Maximum crawl duration in seconds |
RETRY_MAX_ATTEMPTS |
3 |
Max retry attempts for failed requests |
RETRY_INITIAL_DELAY_MS |
500 |
Initial retry delay |
RETRY_MAX_DELAY_MS |
30000 |
Maximum retry delay |
Testing
# Unit tests (377 tests, no network required)
# Lint
# Integration tests (requires network)
# Browser engine tests (requires Chromium)
# Compile all tests without running
Benchmarks
Essence is benchmarked head-to-head against Firecrawl across 35 real-world URLs spanning 7 categories, evaluated by an LLM judge on 5 quality dimensions.
| Metric | Essence | Target | Status |
|---|---|---|---|
| Quality Win Rate (LLM Judge) | 91.2% (31-2-1) | >= 70% | Exceeded |
| Speed Win Rate | 74% (26-9-0) | >= 50% | Exceeded |
| Success Rate | 100% | 100% | Met |
Per-Category Quality (LLM Judge)
| Category | Essence Wins | Win Rate |
|---|---|---|
| Structured (Wikipedia, APIs) | 5/5 | 100% |
| News (BBC, Reuters, Ars) | 5/5 | 100% |
| Reference (HN, arXiv, dev.to) | 5/5 | 100% |
| Content (essays, blogs) | 4/5 | 80% |
| Dynamic (React, Next.js, Vercel) | 4/5 | 80% |
| Docs (Rust, Python, Go, MDN) | 3/5 | 60% |
| E-Commerce (Newegg, eBay, IKEA) | 3/5 | 60% |
See the full benchmark methodology for detailed results.
Contributing
See CONTRIBUTING.md for development setup, code style, and PR process.
This repository includes a CLAUDE.md for AI-assisted development.