servo-fetch-cli
A browser engine in a binary — fetch, render, and extract web content as Markdown, JSON, or screenshots. Powered by Servo.
For programmatic use in Rust, see the servo-fetch library crate.
Install
Pre-built binaries (recommended)
|
Cargo
Usage
Extract content
Batch fetch
Screenshots
JavaScript execution
CSS selector extraction
Visibility filtering
Hidden patterns (cookie banners, modals, aria-hidden, opacity:0, sr-only)
are stripped under the default moderate policy. Use strict to also drop
screen-reader-only content, or off to disable flag-based stripping (semantic
hides like [hidden] / modal dialogs always apply).
Structured extraction (schema)
Pull a declarative set of fields into JSON using CSS selectors — no LLM required. Define a schema once, reuse it across URLs:
Field type values: text, attribute, html, inner_html, nested_list. See servo_fetch::schema for the full reference.
Crawl a site
Discover URLs (sitemap)
SPA / dynamic content
MCP server
HTTP API server
See HTTP API server below for the endpoint reference.
Options
| Flag | Description |
|---|---|
--json |
Structured JSON output (NDJSON for multiple URLs) |
--screenshot <FILE> |
Save PNG screenshot |
--full-page |
Capture full scrollable page (requires --screenshot) |
--js <EXPR> |
Execute JavaScript and print result |
--selector <CSS> |
Extract specific section by CSS selector |
--raw html|text |
Raw HTML or plain text output |
--schema <FILE> |
Extract structured JSON using a CSS-selector schema file |
-t, --timeout <SECS> |
Page load timeout in seconds (default: 30) |
--settle <MS> |
Extra wait after load event in ms (default: 0, max: 10000) |
--user-agent <UA> |
Override the User-Agent string |
-v, --verbose |
Increase log verbosity (-v info, -vv debug, -vvv trace) |
-q, --quiet |
Suppress all logs except errors |
JSON output
--json returns an object with these fields:
| Field | Type | Description |
|---|---|---|
title |
string | Page title |
content |
string | Raw HTML extracted by Readability |
text_content |
string | Readable text (Markdown) |
byline |
string | Author or byline (omitted if not detected) |
excerpt |
string | Short excerpt or description (omitted if not detected) |
lang |
string | Document language (omitted if not detected) |
url |
string | Canonical URL (omitted if not detected) |
Crawl subcommand
servo-fetch crawl <URL> follows same-site links using BFS. Respects robots.txt (RFC 9309) with a default 500ms interval.
| Flag | Description |
|---|---|
--limit <N> |
Maximum pages to crawl (default: 50) |
--max-depth <N> |
Maximum link depth (default: 3) |
--include <GLOB> |
URL path patterns to include |
--exclude <GLOB> |
URL path patterns to exclude |
--json |
Output content as JSON per page |
--selector <CSS> |
Extract specific section per page |
--concurrency <N> |
Maximum parallel page fetches (default: 1; completion order when >1) |
--delay-ms <MS> |
Minimum dispatch interval in ms (default: 500; 0 disables rate limiting) |
--user-agent <UA> |
Override the User-Agent string |
-t, --timeout <SECS> |
Page load timeout in seconds per page (default: 30) |
--settle <MS> |
Extra wait after load event in ms per page (default: 0, max: 10000) |
Map subcommand
servo-fetch map <URL> discovers all URLs on a site via sitemaps without rendering. Falls back to HTML link extraction if no sitemap exists.
| Flag | Description |
|---|---|
--limit <N> |
Maximum URLs to return (default: 5000) |
--include <GLOB> |
URL path patterns to include |
--exclude <GLOB> |
URL path patterns to exclude |
--json |
Output as JSON array with lastmod metadata |
--user-agent <UA> |
Override the User-Agent string |
-t, --timeout <SECS> |
HTTP request timeout in seconds (default: 30) |
--no-fallback |
Skip HTML link extraction fallback |
Logging
Diagnostic messages go to stderr; stdout is reserved for data output so pipes stay clean.
RUST_LOG="servo_fetch=debug" RUST_LOG="servo_fetch=trace,servo=debug"
RUST_LOG uses tracing-subscriber's directive syntax and always wins over CLI flags.
Environment Variables
| Variable | Description |
|---|---|
SERVO_FETCH_USER_AGENT |
Default User-Agent string (overridden by --user-agent) |
SERVO_FETCH_NO_STDERR_FILTER |
Disable Apple OpenGL driver noise filter (debug use) |
RUST_LOG |
Fine-grained log filter (overrides -v/-q) |
MCP Server
Built-in Model Context Protocol server over stdio or Streamable HTTP.
Streamable HTTP: servo-fetch mcp --port 8080
| Parameter | Type | Description |
|---|---|---|
url |
string | URL to fetch (http/https only) |
format |
string? | markdown (default), json, html, text, or accessibility_tree |
max_length |
number? | Max characters to return (default 5000) |
start_index |
number? | Character offset for pagination (default 0) |
timeout |
number? | Page load timeout in seconds (default 30) |
settle_ms |
number? | Extra wait in ms after load event (default 0, max 10000) |
selector |
string? | CSS selector to extract a specific section |
| Parameter | Type | Description |
|---|---|---|
urls |
string[] | URLs to fetch (http/https only, max 20) |
format |
string? | markdown (default) or json |
max_length |
number? | Max characters per URL result (default 5000) |
timeout |
number? | Page load timeout in seconds per URL (default 30) |
settle_ms |
number? | Extra wait in ms after load event (default 0, max 10000) |
selector |
string? | CSS selector to extract a specific section |
| Parameter | Type | Description |
|---|---|---|
url |
string | Starting URL (http/https only) |
limit |
number? | Maximum pages to crawl (default 20, max 500) |
max_depth |
number? | Maximum link depth from seed (default 3, max 10) |
format |
string? | markdown (default) or json |
include_glob |
string[]? | URL path patterns to include |
exclude_glob |
string[]? | URL path patterns to exclude |
max_length |
number? | Max characters per page result (default 5000) |
timeout |
number? | Page load timeout in seconds per page (default 30) |
settle_ms |
number? | Extra wait in ms after load event (default 0, max 10000) |
selector |
string? | CSS selector to extract a specific section per page |
| Parameter | Type | Description |
|---|---|---|
url |
string | Site URL to discover pages for (http/https only) |
limit |
number? | Maximum URLs to return (default 5000) |
include_glob |
string[]? | URL path patterns to include |
exclude_glob |
string[]? | URL path patterns to exclude |
| Parameter | Type | Description |
|---|---|---|
url |
string | URL to capture (http/https only) |
full_page |
boolean? | Capture the full scrollable page (default false) |
timeout |
number? | Page load timeout in seconds (default 30) |
settle_ms |
number? | Extra wait in ms after load event (default 0, max 10000) |
| Parameter | Type | Description |
|---|---|---|
url |
string | URL to load before executing JS |
expression |
string | JavaScript expression to evaluate |
timeout |
number? | Page load timeout in seconds (default 30) |
settle_ms |
number? | Extra wait in ms after load event (default 0, max 10000) |
HTTP API server
servo-fetch serve starts a REST API on the given host/port (default 127.0.0.1:3000). JSON request/response, binary PNG for screenshots.
| Flag | Description |
|---|---|
--host <HOST> |
Bind address (default 127.0.0.1) |
--port <PORT> |
TCP port (default 3000) |
Responses include x-request-id (auto-generated if the request does not supply one); use this for tracing in logs. Errors use a consistent {"error": "..."} JSON shape across all endpoints. Request bodies are capped at 1 MiB. SSRF protection (private/reserved address blocking) applies to every endpoint.
Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness probe ({"status":"ok"}) |
GET |
/version |
{"name":"servo-fetch","version":"..."} |
POST |
/v1/fetch |
Fetch one URL; returns extracted content |
POST |
/v1/batch_fetch |
Fetch up to 20 URLs in parallel |
POST |
/v1/screenshot |
Capture a PNG; image/png body |
POST |
/v1/execute_js |
Evaluate JavaScript in a loaded page |
POST |
/v1/crawl |
BFS crawl starting from a URL |
POST |
/v1/map |
Discover URLs via sitemaps (no rendering) |
Request and response shapes mirror the MCP tool parameters documented above.
Examples
# Fetch → Markdown
# Screenshot → PNG
Docker
Multi-arch image (linux/amd64, linux/arm64) on GitHub Container Registry:
Override the default serve with any servo-fetch subcommand:
Minimum-privilege deployment:
The image runs as UID 1001 and includes a HEALTHCHECK against /health.
Image signing
Images are signed with cosign keyless via GitHub OIDC and ship with SLSA build provenance and an SPDX SBOM as OCI attestations.