# Essence
A fast, open-source web retrieval engine built in Rust. Fetches pages via lightweight HTTP with automatic fallback to headless Chromium for JavaScript-heavy sites. Returns clean, LLM-ready Markdown.
**91% quality win rate** against Firecrawl across 35 real-world URLs, judged by LLM evaluation. **1.5x faster** on average.
## Install
```bash
cargo install essence-engine
```
## Why Essence?
- **Two-tier rendering** -- lightweight HTTP fetch for most pages, automatic Chromium fallback for SPAs and JS-heavy sites
- **LLM-optimized output** -- clean Markdown with noise removal, heading hierarchy, code preservation, and structured metadata
- **4 REST endpoints + MCP server** -- scrape, crawl, map, search -- usable by humans and AI agents alike
- **Self-hosted & open-source** -- no API keys, no rate limits, no vendor lock-in
- **Respectful crawling** -- robots.txt compliance, per-domain rate limits, configurable politeness
- **Fast** -- median response under 1 second for HTTP-rendered pages
## Quick Start
### Prerequisites
- [Rust](https://rustup.rs/) (stable toolchain)
- [Chromium](https://www.chromium.org/) (optional -- only needed for browser engine fallback)
```bash
# macOS
brew install chromium
# Ubuntu/Debian
sudo apt install chromium-browser
# Or skip Chromium -- the HTTP engine handles most pages without a browser
```
### Build and Run
```bash
git clone https://github.com/ruchit-p/essence.git
cd essence/backend
cp .env.example .env
cargo build --release
cargo run --release
# Server starts at http://localhost:8080
```
### Docker
```bash
docker-compose up -d
# Service available at http://localhost:8080
```
### Try It
```bash
# Scrape a page to Markdown
curl -s -X POST http://localhost:8080/api/v1/scrape \
-H "Content-Type: application/json" \
# Discover all URLs on a site
curl -s -X POST http://localhost:8080/api/v1/map \
-H "Content-Type: application/json" \
# Crawl a site (follow links, respect robots.txt)
curl -s -X POST http://localhost:8080/api/v1/crawl \
-H "Content-Type: application/json" \
# Search the web
curl -s -X POST http://localhost:8080/api/v1/search \
-H "Content-Type: application/json" \
---
## API Reference
All endpoints accept JSON POST requests at `/api/v1/`.
### POST /api/v1/scrape
Single-page capture with automatic engine selection.
**Request:**
| `url` | string | *required* | URL to scrape |
| `formats` | string[] | `["markdown"]` | Output formats: `"markdown"`, `"html"`, `"links"`, `"metadata"` |
| `engine` | string | `"auto"` | Rendering engine: `"auto"`, `"http"`, or `"browser"` |
| `timeout` | int | `30000` | Timeout in milliseconds |
| `headers` | object | none | Custom HTTP headers |
| `onlyMainContent` | bool | `true` | Extract only the main content area |
**Response:**
```json
{
"success": true,
"data": {
"markdown": "# Page Title\n\nPage content in clean Markdown...",
"metadata": {
"title": "Page Title",
"description": "Page description",
"language": "en",
"url": "https://example.com",
"statusCode": 200,
"wordCount": 1234,
"readingTime": 5
},
"links": ["https://example.com/about", "https://example.com/contact"],
"images": ["https://example.com/hero.jpg"]
}
}
```
### POST /api/v1/crawl
Multi-page traversal with dedup, robots.txt compliance, and rate limiting.
**Request:**
| `url` | string | *required* | Starting URL |
| `maxDepth` | int | `2` | Maximum link depth to follow |
| `limit` | int | `100` | Maximum pages to crawl |
| `includePaths` | string[] | none | Glob patterns to include (e.g. `["/blog/*"]`) |
| `excludePaths` | string[] | none | Glob patterns to exclude (e.g. `["/admin/*"]`) |
| `allowBackwardLinks` | bool | `false` | Follow links up the URL hierarchy |
| `allowExternalLinks` | bool | `false` | Follow links to other domains |
| `detectPagination` | bool | `true` | Automatically follow paginated content |
**Response:**
```json
{
"success": true,
"data": [
{
"markdown": "# Home\n\nWelcome...",
"metadata": { "title": "Home", "url": "https://example.com", "statusCode": 200 }
},
{
"markdown": "# About\n\nAbout us...",
"metadata": { "title": "About", "url": "https://example.com/about", "statusCode": 200 }
}
]
}
```
### POST /api/v1/map
URL discovery via sitemaps and in-page link extraction.
**Request:**
| `url` | string | *required* | Site URL to discover links from |
| `includeSubdomains` | bool | `true` | Include URLs from subdomains |
| `limit` | int | `5000` | Maximum URLs to return |
| `search` | string | none | Filter URLs by search query |
| `ignoreSitemap` | bool | `false` | Skip sitemap.xml discovery |
**Response:**
```json
{
"success": true,
"links": [
"https://example.com/",
"https://example.com/about",
"https://example.com/blog",
"https://example.com/blog/post-1"
]
}
```
### POST /api/v1/search
Web search via DuckDuckGo with optional scraping of result pages.
**Request:**
| `query` | string | *required* | Search query |
| `limit` | int | `10` | Number of search results |
| `scrapeResults` | bool | `false` | Scrape full content of each result |
**Response:**
```json
{
"success": true,
"data": [
{
"title": "Web Scraping in Rust",
"url": "https://example.com/rust-scraping",
"snippet": "A guide to web scraping with Rust..."
}
]
}
```
### POST /api/v1/crawl/stream
Streaming variant of the crawl endpoint. Returns results via Server-Sent Events (SSE) as pages are crawled, rather than waiting for the full crawl to complete. Accepts the same parameters as `/api/v1/crawl`.
### GET /health
Health check endpoint. Returns `200 OK` when the server is running. Used by Docker's `HEALTHCHECK`.
---
## MCP Server (AI Agent Integration)
Essence includes a built-in [Model Context Protocol](https://modelcontextprotocol.io/) server, allowing AI agents (Claude, Cursor, Windsurf, etc.) to use scrape, crawl, map, and search as tools directly.
The MCP endpoint is available at `http://localhost:8080/mcp` using the Streamable HTTP transport. Start the Essence server first, then connect your MCP client.
### Available MCP Tools
| `scrape` | Fetch a single page and return Markdown content. Params: `url` (required), `formats`, `engine`, `timeout_ms` |
| `crawl` | Crawl a website up to a given depth and page limit. Params: `url` (required), `max_depth`, `limit`, `include_paths`, `exclude_paths` |
| `map` | Discover URLs on a site via sitemaps and link extraction. Params: `url` (required), `search`, `limit`, `include_subdomains` |
| `search` | Search the web and optionally scrape results. Params: `query` (required), `limit`, `scrape_results` |
> **Note:** The MCP server uses `timeout_ms` while the REST API uses `timeout`. Both accept milliseconds.
### Setup with Claude Code
Add to your project's `.mcp.json` (or global `~/.claude/.mcp.json`):
```json
{
"mcpServers": {
"essence": {
"type": "http",
"url": "http://localhost:8080/mcp"
}
}
}
```
Then use Essence tools directly in Claude Code conversations -- Claude will call `scrape`, `crawl`, `map`, and `search` as needed.
### Setup with Claude Desktop
Add to your Claude Desktop config (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS):
```json
{
"mcpServers": {
"essence": {
"type": "http",
"url": "http://localhost:8080/mcp"
}
}
}
```
### Setup with Cursor / Windsurf / Other MCP Clients
Point your MCP client at `http://localhost:8080/mcp` using the HTTP transport. No API key required.
---
## Architecture
### Two-Tier Rendering
```
Request → HTTP Engine (fast, lightweight)
↓ content density too low?
Browser Engine (Chromium CDP)
↓
Markdown Output (clean, LLM-ready)
```
1. **HTTP Engine** (default) -- lightweight fetch via reqwest. Used when the HTML contains sufficient content density.
2. **Browser Engine** (fallback) -- full Chromium automation via CDP. Activated for SPAs, anti-bot pages, and JavaScript-rendered content.
Auto-detection triggers browser fallback based on:
- Content density analysis (too little text in HTML)
- Hydration markers (`__NEXT_DATA__`, `__NUXT__`, etc.)
- Meta-refresh redirects
- Anti-fetch response headers
### Module Structure
```
backend/src/
api/ # Endpoint handlers (scrape, crawl, map, search)
engines/ # HTTP and Browser rendering engines
format/ # Markdown and metadata output formatters
crawler/ # URL frontier, robots.txt, rate limiting, sitemaps, pagination
search/ # DuckDuckGo HTML search integration
cache/ # In-memory caching (moka)
rate_limit/ # Per-domain rate limiting (governor)
utils/ # DNS cache, URL normalization, robots.txt parsing
mcp.rs # MCP server (Model Context Protocol)
types.rs # Request/response types
error.rs # Error types
config.rs # Configuration
main.rs # Server startup with graceful shutdown
```
---
## Configuration
Copy `backend/.env.example` to `backend/.env` and customize:
| `PORT` | `8080` | Server port |
| `RUST_LOG` | `essence=info` | Log level (`debug`, `info`, `warn`, `error`) |
| `BROWSER_HEADLESS` | `true` | Run Chromium in headless mode |
| `BROWSER_POOL_SIZE` | `5` | Number of browser instances in pool |
| `BROWSER_TIMEOUT_MS` | `30000` | Browser page timeout |
| `ENGINE_WATERFALL_ENABLED` | `true` | Enable HTTP-to-browser fallback |
| `ENGINE_WATERFALL_DELAY_MS` | `5000` | Delay before browser fallback |
| `CRAWL_RATE_LIMIT_PER_SEC` | `2` | Crawl rate limit per domain |
| `MAX_CONCURRENT_REQUESTS` | `10` | Max concurrent crawl requests |
| `MAX_PARALLEL_SCRAPES` | `5` | Parallel scrapes for search results |
| `MAX_REQUEST_SIZE_MB` | `1` | Maximum request body size |
| `CRAWL_MAX_DURATION_SEC` | `300` | Maximum crawl duration in seconds |
| `RETRY_MAX_ATTEMPTS` | `3` | Max retry attempts for failed requests |
| `RETRY_INITIAL_DELAY_MS` | `500` | Initial retry delay |
| `RETRY_MAX_DELAY_MS` | `30000` | Maximum retry delay |
---
## Testing
```bash
cd backend
# Unit tests (377 tests, no network required)
cargo test --lib
# Lint
cargo clippy -- -D warnings
# Integration tests (requires network)
cargo test --test integration -- --ignored
# Browser engine tests (requires Chromium)
cargo test --test browser_engine_tests
# Compile all tests without running
cargo test --no-run
```
---
## Benchmarks
Essence is benchmarked head-to-head against [Firecrawl](https://firecrawl.dev) across 35 real-world URLs spanning 7 categories, evaluated by an LLM judge on 5 quality dimensions.
| **Quality Win Rate** (LLM Judge) | **91.2%** (31-2-1) | >= 70% | Exceeded |
| **Speed Win Rate** | **74%** (26-9-0) | >= 50% | Exceeded |
| Success Rate | 100% | 100% | Met |
### Per-Category Quality (LLM Judge)
| Structured (Wikipedia, APIs) | 5/5 | **100%** |
| News (BBC, Reuters, Ars) | 5/5 | **100%** |
| Reference (HN, arXiv, dev.to) | 5/5 | **100%** |
| Content (essays, blogs) | 4/5 | **80%** |
| Dynamic (React, Next.js, Vercel) | 4/5 | **80%** |
| Docs (Rust, Python, Go, MDN) | 3/5 | **60%** |
| E-Commerce (Newegg, eBay, IKEA) | 3/5 | **60%** |
See the [full benchmark methodology](https://github.com/ruchit-p/essence/blob/master/docs/BENCHMARKS.md) for detailed results.
---
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, code style, and PR process.
This repository includes a `CLAUDE.md` for AI-assisted development.
## License
[MIT](LICENSE)