servo-fetch-cli 0.11.2

A browser engine in a binary — fetch, render, and extract web content as Markdown, JSON, or screenshots. Powered by Servo.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
# servo-fetch-cli

[![crates.io](https://img.shields.io/crates/v/servo-fetch-cli.svg)](https://crates.io/crates/servo-fetch-cli)

A browser engine in a binary — fetch, render, and extract web content as Markdown, JSON, or screenshots. Powered by [Servo](https://servo.org/).

For programmatic use in Rust, see the [`servo-fetch`](https://crates.io/crates/servo-fetch) library crate.

## Install

### Pre-built binaries (recommended)

```bash
curl -fsSL https://raw.githubusercontent.com/konippi/servo-fetch/main/install.sh | sh
```

### Cargo

```bash
cargo binstall servo-fetch-cli   # prebuilt binary via cargo-binstall
cargo install servo-fetch-cli    # build from source (requires Rust 1.86.0+)
```

## Usage

### Extract content

```bash
servo-fetch "https://example.com"              # Readable Markdown (default)
servo-fetch "https://example.com" --format json   # Structured JSON
servo-fetch "https://example.com" --format html   # Raw rendered HTML
servo-fetch "https://example.com" --format text   # Plain text (innerText)
```

### Single file output

```bash
servo-fetch "https://example.com" --output page.md   # Save to a single file (single URL only)
```

### Batch fetch

```bash
servo-fetch URL1 URL2 URL3                     # Parallel fetch, Markdown output
servo-fetch URL1 URL2 --format json            # Parallel fetch, NDJSON output
servo-fetch URL1 URL2 --output-dir ./out/      # One file per URL (auto-created)
```

### Screenshots

```bash
servo-fetch "https://example.com" --screenshot page.png
servo-fetch "https://example.com" --screenshot full.png --full-page
```

### JavaScript execution

```bash
servo-fetch "https://example.com" --js "document.title"
servo-fetch "https://example.com" --js "document.querySelectorAll('h2').length"
```

### CSS selector extraction

```bash
servo-fetch "https://example.com" --selector "article"
servo-fetch "https://example.com" --selector ".main-content" --format json
```

### Visibility filtering

Hidden patterns (cookie banners, modals, `aria-hidden`, `opacity:0`, sr-only)
are stripped under the default `moderate` policy. Use `strict` to also drop
screen-reader-only content, or `off` to disable flag-based stripping (semantic
hides like `[hidden]` / modal dialogs always apply).

```bash
servo-fetch "https://example.com" --visibility moderate   # default
servo-fetch "https://example.com" --visibility strict
servo-fetch "https://example.com" --visibility off
```

### Structured extraction (schema)

Pull a declarative set of fields into JSON using CSS selectors — no LLM required. Define a schema once, reuse it across URLs:

```bash
servo-fetch "https://shop.example.com" --schema schema.json
servo-fetch URL1 URL2 URL3 --schema schema.json     # batch → NDJSON
```

```json
{
  "base_selector": ".product",
  "fields": [
    { "name": "title", "selector": "h2", "type": "text" },
    { "name": "price", "selector": ".price", "type": "text" },
    { "name": "url", "selector": "a", "type": "attribute", "attribute": "href" }
  ]
}
```

Field `type` values: `text`, `attribute`, `html`, `inner_html`, `nested_list`. See [`servo_fetch::schema`](https://docs.rs/servo-fetch/latest/servo_fetch/schema/) for the full reference.

### Crawl a site

```bash
servo-fetch crawl "https://docs.example.com" --limit 20
servo-fetch crawl "https://docs.example.com" --include "/docs/**" --exclude "/docs/archive/**"
servo-fetch crawl "https://docs.example.com" --format json --max-depth 5
```

### Discover URLs (sitemap)

```bash
servo-fetch map "https://example.com"
servo-fetch map "https://example.com" --limit 100 --include "/blog/**"
```

### SPA / dynamic content

```bash
servo-fetch "https://spa.example.com" --settle 3000       # Wait 3s after load for hydration
servo-fetch "https://spa.example.com" -t 60 --settle 5000 # 60s timeout + 5s settle
```

### MCP server

```bash
servo-fetch mcp                # stdio transport (for AI agents)
servo-fetch mcp --port 8080    # Streamable HTTP transport
```

### HTTP API server

```bash
servo-fetch serve                            # 127.0.0.1:3000
servo-fetch serve --host 0.0.0.0 --port 80   # expose to network
```

See [HTTP API server](#http-api-server) below for the endpoint reference.

### Healthcheck

```bash
servo-fetch healthcheck --port 3000   # exits 0 on 2xx /health, 1 otherwise
```

Self-contained `/health` probe for any orchestrator that runs a command and inspects the exit code.

## Options

| Flag | Description |
| ---- | ----------- |
| `--format json` | Structured JSON output (NDJSON for multiple URLs) |
| `--screenshot <FILE>` | Save PNG screenshot |
| `--full-page` | Capture full scrollable page (requires `--screenshot`) |
| `--js <EXPR>` | Execute JavaScript and print result |
| `--selector <CSS>` | Extract specific section by CSS selector |
| `--format html\|text` | Raw HTML or plain text output |
| `--schema <FILE>` | Extract structured JSON using a CSS-selector schema file |
| `-t, --timeout <SECS>` | Page load timeout in seconds (default: 30) |
| `--settle <MS>` | Extra wait after load event in ms (default: 0, max: 10000) |
| `--user-agent <UA>` | Override the User-Agent string |
| `-v, --verbose` | Increase log verbosity (`-v` info, `-vv` debug, `-vvv` trace) |
| `-q, --quiet` | Suppress all logs except errors |

### JSON output

`--format json` returns an object with these fields:

| Field | Type | Description |
| ----- | ---- | ----------- |
| `title` | string | Page title |
| `content` | string | Raw HTML extracted by Readability |
| `text_content` | string | Readable text (Markdown) |
| `byline` | string | Author or byline (omitted if not detected) |
| `excerpt` | string | Short excerpt or description (omitted if not detected) |
| `lang` | string | Document language (omitted if not detected) |
| `url` | string | Canonical URL (omitted if not detected) |

### Crawl subcommand

`servo-fetch crawl <URL>` follows same-site links using BFS. Respects `robots.txt` (RFC 9309) with a default 500ms interval.

| Flag | Description |
| ---- | ----------- |
| `--limit <N>` | Maximum pages to crawl (default: 50) |
| `--max-depth <N>` | Maximum link depth (default: 3) |
| `--include <GLOB>` | URL path patterns to include |
| `--exclude <GLOB>` | URL path patterns to exclude |
| `--format json` | Output content as JSON per page |
| `--selector <CSS>` | Extract specific section per page |
| `--concurrency <N>` | Maximum parallel page fetches (default: 1; completion order when >1) |
| `--delay-ms <MS>` | Minimum dispatch interval in ms (default: 500; 0 disables rate limiting) |
| `--user-agent <UA>` | Override the User-Agent string |
| `-t, --timeout <SECS>` | Page load timeout in seconds per page (default: 30) |
| `--settle <MS>` | Extra wait after load event in ms per page (default: 0, max: 10000) |

### Map subcommand

`servo-fetch map <URL>` discovers all URLs on a site via sitemaps without rendering. Falls back to HTML link extraction if no sitemap exists.

| Flag | Description |
| ---- | ----------- |
| `--limit <N>` | Maximum URLs to return (default: 5000) |
| `--include <GLOB>` | URL path patterns to include |
| `--exclude <GLOB>` | URL path patterns to exclude |
| `--format json` | Output as JSON array with lastmod metadata |
| `--user-agent <UA>` | Override the User-Agent string |
| `-t, --timeout <SECS>` | HTTP request timeout in seconds (default: 30) |
| `--no-fallback` | Skip HTML link extraction fallback |

## Logging

Diagnostic messages go to stderr; stdout is reserved for data output so pipes stay clean.

```bash
servo-fetch -v "https://example.com"                       # info and above
servo-fetch -vv "https://example.com"                      # debug
servo-fetch -vvv "https://example.com"                     # trace
servo-fetch -q "https://example.com"                       # errors only
RUST_LOG="servo_fetch=debug" servo-fetch "https://..."     # fine-grained override
RUST_LOG="servo_fetch=trace,servo=debug" servo-fetch "..." # include Servo internals
```

`RUST_LOG` uses [`tracing-subscriber`'s directive syntax](https://docs.rs/tracing-subscriber/latest/tracing_subscriber/filter/struct.EnvFilter.html) and always wins over CLI flags.

## Environment Variables

| Variable | Description |
| -------- | ----------- |
| `SERVO_FETCH_USER_AGENT` | Default User-Agent string (overridden by `--user-agent`) |
| `SERVO_FETCH_NO_STDERR_FILTER` | Disable Apple OpenGL driver noise filter (debug use) |
| `RUST_LOG` | Fine-grained log filter (overrides `-v`/`-q`) |

## MCP Server

Built-in [Model Context Protocol](https://modelcontextprotocol.io/) server over stdio or Streamable HTTP.

```json
{
  "mcpServers": {
    "servo-fetch": {
      "command": "servo-fetch",
      "args": ["mcp"]
    }
  }
}
```

Streamable HTTP: `servo-fetch mcp --port 8080`

<details>
<summary><b>fetch</b> — extract readable content from a URL</summary>

| Parameter | Type | Description |
| --------- | ---- | ----------- |
| `url` | string | URL to fetch (http/https only) |
| `format` | string? | `markdown` (default), `json`, `html`, `text`, or `accessibility_tree` |
| `max_length` | number? | Max characters to return (default 5000) |
| `start_index` | number? | Character offset for pagination (default 0) |
| `timeout` | number? | Page load timeout in seconds (default 30) |
| `settle_ms` | number? | Extra wait in ms after load event (default 0, max 10000) |
| `selector` | string? | CSS selector to extract a specific section |

</details>

<details>
<summary><b>batch_fetch</b> — fetch multiple URLs in parallel</summary>

| Parameter | Type | Description |
| --------- | ---- | ----------- |
| `urls` | string[] | URLs to fetch (http/https only, max 20) |
| `format` | string? | `markdown` (default) or `json` |
| `max_length` | number? | Max characters per URL result (default 5000) |
| `timeout` | number? | Page load timeout in seconds per URL (default 30) |
| `settle_ms` | number? | Extra wait in ms after load event (default 0, max 10000) |
| `selector` | string? | CSS selector to extract a specific section |

</details>

<details>
<summary><b>crawl</b> — crawl a website by following links</summary>

| Parameter | Type | Description |
| --------- | ---- | ----------- |
| `url` | string | Starting URL (http/https only) |
| `limit` | number? | Maximum pages to crawl (default 20, max 500) |
| `max_depth` | number? | Maximum link depth from seed (default 3, max 10) |
| `format` | string? | `markdown` (default) or `json` |
| `include_glob` | string[]? | URL path patterns to include |
| `exclude_glob` | string[]? | URL path patterns to exclude |
| `max_length` | number? | Max characters per page result (default 5000) |
| `timeout` | number? | Page load timeout in seconds per page (default 30) |
| `settle_ms` | number? | Extra wait in ms after load event (default 0, max 10000) |
| `selector` | string? | CSS selector to extract a specific section per page |

</details>

<details>
<summary><b>map</b> — discover URLs via sitemaps without rendering</summary>

| Parameter | Type | Description |
| --------- | ---- | ----------- |
| `url` | string | Site URL to discover pages for (http/https only) |
| `limit` | number? | Maximum URLs to return (default 5000) |
| `include_glob` | string[]? | URL path patterns to include |
| `exclude_glob` | string[]? | URL path patterns to exclude |

</details>

<details>
<summary><b>screenshot</b> — capture a PNG screenshot (no GPU required)</summary>

| Parameter | Type | Description |
| --------- | ---- | ----------- |
| `url` | string | URL to capture (http/https only) |
| `full_page` | boolean? | Capture the full scrollable page (default false) |
| `timeout` | number? | Page load timeout in seconds (default 30) |
| `settle_ms` | number? | Extra wait in ms after load event (default 0, max 10000) |

</details>

<details>
<summary><b>execute_js</b> — evaluate JavaScript in a loaded page</summary>

| Parameter | Type | Description |
| --------- | ---- | ----------- |
| `url` | string | URL to load before executing JS |
| `expression` | string | JavaScript expression to evaluate |
| `timeout` | number? | Page load timeout in seconds (default 30) |
| `settle_ms` | number? | Extra wait in ms after load event (default 0, max 10000) |

</details>

## HTTP API server

`servo-fetch serve` starts a REST API on the given host/port (default `127.0.0.1:3000`). JSON request/response, binary PNG for screenshots.

| Flag | Description |
| ---- | ----------- |
| `--host <HOST>` | Bind address (default `127.0.0.1`) |
| `--port <PORT>` | TCP port (default `3000`) |

Responses include `x-request-id` (auto-generated if the request does not supply one); use this for tracing in logs. Errors use a consistent `{"error": "..."}` JSON shape across all endpoints. Request bodies are capped at 1 MiB. SSRF protection (private/reserved address blocking) applies to every endpoint.

### Endpoints

| Method | Path | Description |
| ------ | ---- | ----------- |
| `GET` | `/health` | Liveness probe (`{"status":"ok"}`) |
| `GET` | `/version` | `{"name":"servo-fetch","version":"..."}` |
| `POST` | `/v1/fetch` | Fetch one URL; returns extracted content |
| `POST` | `/v1/batch_fetch` | Fetch up to 20 URLs in parallel |
| `POST` | `/v1/screenshot` | Capture a PNG; `image/png` body |
| `POST` | `/v1/execute_js` | Evaluate JavaScript in a loaded page |
| `POST` | `/v1/crawl` | BFS crawl starting from a URL |
| `POST` | `/v1/map` | Discover URLs via sitemaps (no rendering) |

Request and response shapes mirror the MCP tool parameters documented above.

### Examples

```bash
# Fetch → Markdown
curl -X POST http://127.0.0.1:3000/v1/fetch \
  -H 'content-type: application/json' \
  -d '{"url":"https://example.com"}'

# Screenshot → PNG
curl -X POST http://127.0.0.1:3000/v1/screenshot \
  -H 'content-type: application/json' \
  -d '{"url":"https://example.com","full_page":true}' \
  -o page.png
```

## Docker

Multi-arch image (`linux/amd64`, `linux/arm64`) on GitHub Container Registry:

```bash
docker run --rm -p 3000:3000 ghcr.io/konippi/servo-fetch:latest
```

Override the default `serve` with any `servo-fetch` subcommand:

```bash
docker run --rm ghcr.io/konippi/servo-fetch:latest https://example.com
docker run --rm ghcr.io/konippi/servo-fetch:latest --version
```

Minimum-privilege deployment:

```bash
docker run --rm -p 3000:3000 \
  --read-only --tmpfs /tmp \
  --cap-drop=ALL \
  --security-opt=no-new-privileges \
  ghcr.io/konippi/servo-fetch:latest
```

The image runs as UID 1001 and includes a `HEALTHCHECK` against `/health`.

### Image signing

Images are signed with [cosign](https://github.com/sigstore/cosign) keyless via GitHub OIDC and ship with [SLSA build provenance](https://slsa.dev/) and an [SPDX SBOM](https://spdx.dev/) as OCI attestations.

```bash
cosign verify ghcr.io/konippi/servo-fetch:latest \
  --certificate-identity-regexp '^https://github.com/konippi/servo-fetch/' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com
```