essence-engine 0.2.0

A fast web retrieval engine with HTTP-to-browser fallback, producing LLM-ready Markdown
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
# Essence

[Website]https://essence.foundation | [Docs]https://essence.foundation/docs | [crates.io]https://crates.io/crates/essence-engine | [GitHub]https://github.com/ruchit-p/essence

A fast, open-source web retrieval engine built in Rust. Fetches pages via lightweight HTTP with automatic fallback to headless Chromium for JavaScript-heavy sites. Returns clean, LLM-ready Markdown.

**91% quality win rate** against Firecrawl across 35 real-world URLs, judged by LLM evaluation. **1.5x faster** on average.

## Install

```bash
cargo install essence-engine
```

## Why Essence?

- **Two-tier rendering** -- lightweight HTTP fetch for most pages, automatic Chromium fallback for SPAs and JS-heavy sites
- **LLM-optimized output** -- clean Markdown with noise removal, heading hierarchy, code preservation, and structured metadata
- **4 REST endpoints + MCP server** -- scrape, crawl, map, search -- usable by humans and AI agents alike
- **Self-hosted & open-source** -- no API keys, no rate limits, no vendor lock-in
- **Respectful crawling** -- robots.txt compliance, per-domain rate limits, configurable politeness
- **Fast** -- median response under 1 second for HTTP-rendered pages

## Quick Start

### Prerequisites

- [Rust]https://rustup.rs/ (stable toolchain)
- [Chromium]https://www.chromium.org/ (optional -- only needed for browser engine fallback)

```bash
# macOS
brew install chromium

# Ubuntu/Debian
sudo apt install chromium-browser

# Or skip Chromium -- the HTTP engine handles most pages without a browser
```

### Build and Run

```bash
git clone https://github.com/ruchit-p/essence.git
cd essence/backend
cp .env.example .env
cargo build --release
cargo run --release
# Server starts at http://localhost:8080
```

### Docker

```bash
docker-compose up -d
# Service available at http://localhost:8080
```

### Try It

```bash
# Scrape a page to Markdown
curl -s -X POST http://localhost:8080/api/v1/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "formats": ["markdown"]}' | jq .

# Discover all URLs on a site
curl -s -X POST http://localhost:8080/api/v1/map \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}' | jq .

# Crawl a site (follow links, respect robots.txt)
curl -s -X POST http://localhost:8080/api/v1/crawl \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "maxDepth": 2, "limit": 10}' | jq .

# Search the web
curl -s -X POST http://localhost:8080/api/v1/search \
  -H "Content-Type: application/json" \
  -d '{"query": "rust web scraping", "limit": 5}' | jq .
```

---

## API Reference

All endpoints accept JSON POST requests at `/api/v1/`.

### POST /api/v1/scrape

Single-page capture with automatic engine selection.

**Request:**

| Field | Type | Default | Description |
|---|---|---|---|
| `url` | string | *required* | URL to scrape |
| `formats` | string[] | `["markdown"]` | Output formats: `"markdown"`, `"html"`, `"links"`, `"metadata"` |
| `engine` | string | `"auto"` | Rendering engine: `"auto"`, `"http"`, or `"browser"` |
| `timeout` | int | `30000` | Timeout in milliseconds |
| `headers` | object | none | Custom HTTP headers |
| `onlyMainContent` | bool | `true` | Extract only the main content area |

**Response:**

```json
{
  "success": true,
  "data": {
    "markdown": "# Page Title\n\nPage content in clean Markdown...",
    "metadata": {
      "title": "Page Title",
      "description": "Page description",
      "language": "en",
      "url": "https://example.com",
      "statusCode": 200,
      "wordCount": 1234,
      "readingTime": 5
    },
    "links": ["https://example.com/about", "https://example.com/contact"],
    "images": ["https://example.com/hero.jpg"]
  }
}
```

### POST /api/v1/crawl

Multi-page traversal with dedup, robots.txt compliance, and rate limiting.

**Request:**

| Field | Type | Default | Description |
|---|---|---|---|
| `url` | string | *required* | Starting URL |
| `maxDepth` | int | `2` | Maximum link depth to follow |
| `limit` | int | `100` | Maximum pages to crawl |
| `includePaths` | string[] | none | Glob patterns to include (e.g. `["/blog/*"]`) |
| `excludePaths` | string[] | none | Glob patterns to exclude (e.g. `["/admin/*"]`) |
| `allowBackwardLinks` | bool | `false` | Follow links up the URL hierarchy |
| `allowExternalLinks` | bool | `false` | Follow links to other domains |
| `detectPagination` | bool | `true` | Automatically follow paginated content |

**Response:**

```json
{
  "success": true,
  "data": [
    {
      "markdown": "# Home\n\nWelcome...",
      "metadata": { "title": "Home", "url": "https://example.com", "statusCode": 200 }
    },
    {
      "markdown": "# About\n\nAbout us...",
      "metadata": { "title": "About", "url": "https://example.com/about", "statusCode": 200 }
    }
  ]
}
```

### POST /api/v1/map

URL discovery via sitemaps and in-page link extraction.

**Request:**

| Field | Type | Default | Description |
|---|---|---|---|
| `url` | string | *required* | Site URL to discover links from |
| `includeSubdomains` | bool | `true` | Include URLs from subdomains |
| `limit` | int | `5000` | Maximum URLs to return |
| `search` | string | none | Filter URLs by search query |
| `ignoreSitemap` | bool | `false` | Skip sitemap.xml discovery |

**Response:**

```json
{
  "success": true,
  "links": [
    "https://example.com/",
    "https://example.com/about",
    "https://example.com/blog",
    "https://example.com/blog/post-1"
  ]
}
```

### POST /api/v1/search

Web search via DuckDuckGo with optional scraping of result pages.

**Request:**

| Field | Type | Default | Description |
|---|---|---|---|
| `query` | string | *required* | Search query |
| `limit` | int | `10` | Number of search results |
| `scrapeResults` | bool | `false` | Scrape full content of each result |

**Response:**

```json
{
  "success": true,
  "data": [
    {
      "title": "Web Scraping in Rust",
      "url": "https://example.com/rust-scraping",
      "snippet": "A guide to web scraping with Rust..."
    }
  ]
}
```

### POST /api/v1/crawl/stream

Streaming variant of the crawl endpoint. Returns results via Server-Sent Events (SSE) as pages are crawled, rather than waiting for the full crawl to complete. Accepts the same parameters as `/api/v1/crawl`.

### GET /health

Health check endpoint. Returns `200 OK` when the server is running. Used by Docker's `HEALTHCHECK`.

---

## MCP Server (AI Agent Integration)

Essence includes a built-in [Model Context Protocol](https://modelcontextprotocol.io/) server, allowing AI agents (Claude, Cursor, Windsurf, etc.) to use scrape, crawl, map, and search as tools directly.

The MCP endpoint is available at `http://localhost:8080/mcp` using the Streamable HTTP transport. Start the Essence server first, then connect your MCP client.

### Available MCP Tools

| Tool | Description |
|---|---|
| `scrape` | Fetch a single page and return Markdown content. Params: `url` (required), `formats`, `engine`, `timeout_ms` |
| `crawl` | Crawl a website up to a given depth and page limit. Params: `url` (required), `max_depth`, `limit`, `include_paths`, `exclude_paths` |
| `map` | Discover URLs on a site via sitemaps and link extraction. Params: `url` (required), `search`, `limit`, `include_subdomains` |
| `search` | Search the web and optionally scrape results. Params: `query` (required), `limit`, `scrape_results` |

> **Note:** The MCP server uses `timeout_ms` while the REST API uses `timeout`. Both accept milliseconds.

### Setup with Claude Code

Add to your project's `.mcp.json` (or global `~/.claude/.mcp.json`):

```json
{
  "mcpServers": {
    "essence": {
      "type": "http",
      "url": "http://localhost:8080/mcp"
    }
  }
}
```

Then use Essence tools directly in Claude Code conversations -- Claude will call `scrape`, `crawl`, `map`, and `search` as needed.

### Setup with Claude Desktop

Add to your Claude Desktop config (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS):

```json
{
  "mcpServers": {
    "essence": {
      "type": "http",
      "url": "http://localhost:8080/mcp"
    }
  }
}
```

### Setup with Cursor / Windsurf / Other MCP Clients

Point your MCP client at `http://localhost:8080/mcp` using the HTTP transport. No API key required.

---

## Architecture

### Two-Tier Rendering

```
Request → HTTP Engine (fast, lightweight)
              ↓ content density too low?
          Browser Engine (Chromium CDP)
          Markdown Output (clean, LLM-ready)
```

1. **HTTP Engine** (default) -- lightweight fetch via reqwest. Used when the HTML contains sufficient content density.
2. **Browser Engine** (fallback) -- full Chromium automation via CDP. Activated for SPAs, anti-bot pages, and JavaScript-rendered content.

Auto-detection triggers browser fallback based on:
- Content density analysis (too little text in HTML)
- Hydration markers (`__NEXT_DATA__`, `__NUXT__`, etc.)
- Meta-refresh redirects
- Anti-fetch response headers

### Module Structure

```
backend/src/
  api/           # Endpoint handlers (scrape, crawl, map, search)
  engines/       # HTTP and Browser rendering engines
  format/        # Markdown and metadata output formatters
  crawler/       # URL frontier, robots.txt, rate limiting, sitemaps, pagination
  search/        # DuckDuckGo HTML search integration
  cache/         # In-memory caching (moka)
  rate_limit/    # Per-domain rate limiting (governor)
  utils/         # DNS cache, URL normalization, robots.txt parsing
  mcp.rs         # MCP server (Model Context Protocol)
  types.rs       # Request/response types
  error.rs       # Error types
  config.rs      # Configuration
  main.rs        # Server startup with graceful shutdown
```

---

## Configuration

Copy `backend/.env.example` to `backend/.env` and customize:

| Variable | Default | Description |
|---|---|---|
| `PORT` | `8080` | Server port |
| `RUST_LOG` | `essence=info` | Log level (`debug`, `info`, `warn`, `error`) |
| `BROWSER_HEADLESS` | `true` | Run Chromium in headless mode |
| `BROWSER_POOL_SIZE` | `5` | Number of browser instances in pool |
| `BROWSER_TIMEOUT_MS` | `30000` | Browser page timeout |
| `ENGINE_WATERFALL_ENABLED` | `true` | Enable HTTP-to-browser fallback |
| `ENGINE_WATERFALL_DELAY_MS` | `5000` | Delay before browser fallback |
| `CRAWL_RATE_LIMIT_PER_SEC` | `2` | Crawl rate limit per domain |
| `MAX_CONCURRENT_REQUESTS` | `10` | Max concurrent crawl requests |
| `MAX_PARALLEL_SCRAPES` | `5` | Parallel scrapes for search results |
| `MAX_REQUEST_SIZE_MB` | `1` | Maximum request body size |
| `CRAWL_MAX_DURATION_SEC` | `300` | Maximum crawl duration in seconds |
| `RETRY_MAX_ATTEMPTS` | `3` | Max retry attempts for failed requests |
| `RETRY_INITIAL_DELAY_MS` | `500` | Initial retry delay |
| `RETRY_MAX_DELAY_MS` | `30000` | Maximum retry delay |

---

## Testing

```bash
cd backend

# Unit tests (377 tests, no network required)
cargo test --lib

# Lint
cargo clippy -- -D warnings

# Integration tests (requires network)
cargo test --test integration -- --ignored

# Browser engine tests (requires Chromium)
cargo test --test browser_engine_tests

# Compile all tests without running
cargo test --no-run
```

---

## Benchmarks

Essence is benchmarked head-to-head against [Firecrawl](https://firecrawl.dev) across 35 real-world URLs spanning 7 categories, evaluated by an LLM judge on 5 quality dimensions.

| Metric | Essence | Target | Status |
|---|---|---|---|
| **Quality Win Rate** (LLM Judge) | **91.2%** (31-2-1) | >= 70% | Exceeded |
| **Speed Win Rate** | **74%** (26-9-0) | >= 50% | Exceeded |
| Success Rate | 100% | 100% | Met |

### Per-Category Quality (LLM Judge)

| Category | Essence Wins | Win Rate |
|---|---|---|
| Structured (Wikipedia, APIs) | 5/5 | **100%** |
| News (BBC, Reuters, Ars) | 5/5 | **100%** |
| Reference (HN, arXiv, dev.to) | 5/5 | **100%** |
| Content (essays, blogs) | 4/5 | **80%** |
| Dynamic (React, Next.js, Vercel) | 4/5 | **80%** |
| Docs (Rust, Python, Go, MDN) | 3/5 | **60%** |
| E-Commerce (Newegg, eBay, IKEA) | 3/5 | **60%** |

See the [full benchmark methodology](https://github.com/ruchit-p/essence/blob/master/docs/BENCHMARKS.md) for detailed results.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, code style, and PR process.

This repository includes a `CLAUDE.md` for AI-assisted development.

## License

[MIT](LICENSE)